Documentation – Development

Introduction

Proteomatic scripts may be implemented in one of the supported scripting languages:

  • Ruby
  • Python
  • PHP

More scripting languages can be implemented, but operating system independent and lightweight languages are strongly preferred to retain overall platform independence and ease of use.

Script types

In an attempt to unify the usage of scripts, two usage patterns have been defined. Each self-describing script must conform to either the processor or converter pattern.

Processor scripts

Processor scripts may define an arbitrary amount of input file groups with varying minimum/maximum file count constraints for each group. Input file groups are defined via a file type (which are in turn defined via filename extensions). Input files are assigned to the correct group automatically as long as the filename extensions are unambiguous among the groups. Otherwise, filenames must be prepended with the appropriate group switch on the command line, as detailed in the help message obtained with --help. In the GUI, additional input group proxy boxes are created which input files must be connected to.

The processor usage pattern allows for an arbitrary amount of input files, but there is a fixed set of output files, each of which can be enabled or disabled. Note that input files and output files are optional in this usage pattern.

Converter scripts

The converter usage pattern allows for an arbitrary amount of input file groups. One input group is defined as the converter input group and every file in this group gets converted. This means that the number of output files is equal to the number of input files in the converter input group. Additional input files in other input groups may be defined which can be used for the conversion but will not be converted themselves.

Implementing scripts

Every Proteomatic script consists of two distinct parts:

  • a description containing detailed information about parameters and input and output files
  • the actual implementation, in any of the supported scripting languages

The description is implicitly linked to the implementation via its filename and is independent of the scripting language used.

The script description

The script description is a YAML-formatted document and must be placed in the include/properties directory. Using a ‘tidy up’ script as an example, the script description would be saved in include/properties/tidy-up.yaml (the purpose of this script is purely demonstration):

title: Tidy up

group: Miscellaneous/Word processing

type: processor

description: >
  Tidy up a string by sorting all letters 
  and placing them neatly next to each other.
  
parameters:
  - key: text
    label: Text
    type: string
    default: 'Hello, fellow!'

Please note that tabs are not allowed in YAML files. Apart from that the format is really easy to learn. The individual keys defined above are:

title

The script title should be short and descriptive. It is displayed to the user in the GUI and on the website.

group

Use the group to specify a menu path in the scripts menu. In this example, a top-level menu ‘Miscellaneous’ would be created, and in it, a sub-menu called ‘Word processing’, which would finally contain the script.

type

One of the two script types, processor or converter.

description

Describe what the script does in a very concise way here. Start with an active verb like ‘Tidy up’. Do not say ‘Tidies up…’ or ‘This script tidies up…’.
If you have more than one sentence, place the first sentence and each following paragraph in <p> … </p> tags, but make sure that the description does not become too long. For additional documentation, you can use a documentation file (which in this example would be called include/documentation/tidy-up.html).

Parameters

If your script has parameters, place them here, in the order you want them to appear.

key

The parameter key is used to set the parameter on the command-line level. For example, the text parameter would be specified like this: -text "Hello, fellow". Use lower camel case for parameter keys, for example parseMode.

label

The label represents the parameter in a human-readable way. Use proper capitalization here.

type

The following parameter types are available:

  • string: A character string.
    This type has specific options:
    • textfield: yes — show a text field in the GUI, allowing the user to enter more than a single line of text.
    • examples: [string list] — provides examples to the user via an editable drop-down box in the GUI.
  • int: An integer.
    This type has specific options:
    • min: [number] — the smallest number allowed
    • max: [number] — the highest number allowed
    • suffix: [string] — a suffix which is displayed after the number
  • float: A floating point number.
    This type has specific options:
    • min: [number] — the smallest number allowed
    • max: [number] — the highest number allowed
    • suffix: [string] — a suffix which is displayed after the number
    • decimals: [number] — the number of decimals displayed in the GUI
  • enum: Creates a drop-down box in the GUI, presenting multiple choices to user.
    Requires the definition of the possible values:
    • choices: [list of choices] — each entry in the list can either be a string (in which case key and label are equal) or a dictionary with a single key/value entry, where the value is used as the label.
      Example:
      parameters:
        - key: loop
          label: Number of loops
          type: enum
          choices: ['one', 'many', {'infinite': 'Infinite - this is insane!'}]
      
      In this example, the parameters would be specified with single words on the command line but the GUI would present the verbose representation of the ‘infinite’ choice to the user.
  • csvString: The same as enum, but with multiple choices possible (including no choice at all). The chosen parameters are specified as a string composed of comma-separated keys.
  • bool: A boolean value (true or false). You may also use flag as a synonym.
default

If you don’t specify a default value for your parameter, sensible default values will be chosen automatically (empty strings and zeros, false for bools/flags, the first choice for enums, no choice for csvStrings).

In addition, the following keys are possible for parameters:

description

Describe the parameter if it is not obvious. This description will be shown on the website and, in the GUI, in a tool tip when the user hovers over the parameter.

Input files

In addition to parameters, input files may be specified. An example of this is the Check MD5 sums script which takes a number of input files and creates an overall MD5 checksum. This is the description (include/properties/md5sum.yaml):

group: Miscellaneous

title: Check MD5 sums

description: >
  <p>
  Determine the MD5 sum of a set of input files.
  </p>
  <p>
  All input files are sorted by filename (regardless of their directory) and 
  concatenated for the MD5 creation. This is done to ensure equal MD5 checksums 
  for equal sets of input files, regardless of the order the files are specified in.
  </p>

type: processor

input:
  - key: files
    label: input
    min: 1
    formats: [any]

parameters:
  - key: assertMd5
    type: string
    default: ''
    label: Expected MD5
    description: >
      If you expect a certain MD5 sum, enter it here. If the determined checksum
      is different from the expected result, the script will exit with an error.

The input key contains a list of input groups. Each group has a key and a label, and min/max file count requirements may be specified. Most importantly, the possible file formats are specified.

File formats are specified in the CLI tools atlas, in include/cli-tools-atlas/formats, here are a couple of file formats:

  • any.yaml
  • csv.yaml
  • fasta.yaml
  • html.yaml
  • mgf.yaml
  • txt.yaml
  • xhtml.yaml
  • xml-mzml.yaml
  • xml-mzml-compressed.yaml

File formats are specified via their key, as defined by the filename. In order to use CSV files as input files, csv must be specified. Let’s look into csv.yaml:

extensions: ['.csv']
description: Comma separated values

Several filename extensions are possible for a file format. This information is used by Proteomatic to assign input files to input groups based on their filename extension. If you need to add file formats here, just add them to your local directory, they will be available in Proteomatic. If you would like to share your contribution, please add your work to the CLI tools atlas.

For the label, use lowercase words except where it’s required: ‘input’, ‘database’, ‘entries’, ‘FASTA database’ are good. In some places, Proteomatic will capitalize the first letter of the label, and sometimes it is used in a sentence.

Output files

If you want your script to produce output files, put the output key into the description:

group: Proteomics/Miscellaneous

title: Tryptic protein digestion

description: >
  Digest a protein into tryptic peptides,
  allowing for missed cleavages.

type: processor

input:
  - key: sequences
    format: fasta
    label: FASTA database

output:
  - key: results
    format: txt
    filename: tryptic-peptides.txt
    label: tryptic peptides

defaultOutputDirectory: sequences

proposePrefix:
  - sequences

parameters:
  - key: protein
    type: string
    label: Protein
    textfield: yes
  - key: mc
    type: int
    label: Missed cleavage sites
    min: 0
    default: 1
  - key: minLength
    type: int
    label: Minimum peptide length
    min: 0
    default: 0

For the label, the same rules apply as to input files: Use lowercase words except where it’s required (e. g. FASTA).

Additional keys regarding output files:

defaultOutputDirectory

If your script provides output files, you must set defaultOutputDirectory to one of your input file groups. If the user does not explicitly specify an output directory, all output files are written to the directory of one of the user-provided input files from the specified group.

proposePrefix

The filename of output files is fixed but can be prepended with a user-defined prefix.
Proteomatic offers a ‘magic wand’ feature for proposing prefixes based on the input files specified.
By default, the input files of the group defined by defaultOutputDirectory are used to propose a prefix. If you want to change this, you can explicitly specify a list of input groups which should be used for prefix proposals.

Converter script output files

With converter scripts, output files must be specified in a different way because for every input file of one group, an output file created. This means that the number of output files is not fixed as it is for processor scripts, but is defined at run time depending on the number of input files provided by the user.

Here is an example:

group: Proteomics/Protein databases

title: Chop FASTA databases

description: >
  Chop FASTA databases into chunks.

type: converter

input:
  - key: databases
    label: FASTA database
    formats: [fasta]

output:
  - key: databases
    label: chopped database
    format: fasta
    filename: "#{basename}-chopped.fasta"
    default: yes
        
defaultOutputDirectory: databases

parameters:
  - key: windowSize
    type: int
    default: 400
    min: 1
    label: Chunk size
  - key: windowShift
    type: float
    default: 50.0
    suffix: '%'
    decimals: 2
    label: Chunk overlap
    min: 0.0
    max: 100.0

The task of this script is to convert a input FASTA database to a FASTA database which contains fixed-length overlapping subsequences of the input sequences.

The main difference here is that only one output file group is specified and the key is set to the key of the corresponding input file group (the files which should be converted), in this case, database. Because several input files are created, the output filename includes parts of the respective input filename via the #{ } syntax (the Ruby lilypad). The following specifiers are supported:

  • #{filename} — the complete input filename (‘input.txt’)
  • #{basename} — the input filename up to the first dot (‘input’)
  • #{extension} — the input filename extension (‘.txt’)

Another way to specify the output filename in the above example would be #{basename}-chopped.#{extension}. This would not set the output filename extension to .fasta but would retain the input filename extension, which could also be .fas, .fa, or .faa.

Implementing the script code

Create a file in the Proteomatic scripts base directory with the same filename as the description file, but ending on .rb, .py, or .php instead of .yaml, and chmod +x it to make it executable. Don’t forget to add a shebang like #!/usr/bin/env ruby as the first line.

The basic principle of implementing a Proteomatic script is defining a subclass of ProteomaticScript and implementing the virtual run() method. In this method, three instance variables input, output and param are made available by the underlying framework. Finally, an object called script is instantiated from the class, and everything else is handled by the superclass constructor.

All user-provided information is passed, depending on the language used, in three variables:

param

A dictionary of parameters with keys as specified in the YAML description and their respective values.

input

A dictionary in which input group keys point to a list of input files, which may be empty, depending on the min/max requirements defined in the YAML description.

output

A dictionary in which output group keys point to the complete output filename, including the directory and prefix. If an output file was not requested, the key will not be defined.

Behind the scenes, output filenames are suffixed with .proteomatic.part, and Proteomatic takes care of renaming output files upon successful completion of the script. This way, broken files will not be considered as valid output files by mistake because they have the suffix.

Converter scripts: For converter scripts, the output dictionary contains an output filename for every input filename, for example: {'a1.fasta' => 'a1-converted.fasta', 'b2.fasta' => 'b2-converted.fasta'}.

The following example are different implementations of the same script (Transpose DNA).

Ruby

In Ruby, all neceassary information is passed via the instance variables @param, @input and @output. For keys, Ruby symbols instead of strings are used.

#! /usr/bin/env ruby

require './include/ruby/proteomatic'

class TransposeDna < ProteomaticScript
    def run()
        # convert all characters to upper case
        dna = @param[:nucleotides].upcase
        # remove invalid characters
        dna.gsub!(/[^CGAT]/, '')
        # reverse sequence
        dna.reverse!
        # replace nucleotides with
        dna.tr!('ACGT', 'TGCA')
        # output transposed DNA
        puts dna
        File::open(@output[:result], 'w') { |f| f.puts dna } if @output[:result]
    end
end

script = TransposeDna.new

Python

In Python, all neceassary information is passed via the instance variables self.param, self.input and self.output.

#! /usr/bin/env python

import sys, os
sys.path.append('./include/python')
import proteomatic
import string
import re

class TransposeDna(proteomatic.ProteomaticScript):
    def run(self):
        # convert all characters to upper case
        # Attention: parameters are Unicode because of the JSON parser 
        # used behind the scenes, convert nucleotides to ASCII string
        dna = str(self.param['nucleotides']).upper()
        # remove invalid characters
        dna = re.sub('[^ACGT]', '', dna)
        # reverse sequence
        dna = dna[::-1]
        # replace nucleotides
        dna = dna.translate(string.maketrans('ACGT', 'TGCA'))
        # output transposed DNA
        print(dna)
        
        if 'result' in self.output:
            with open(self.output['result'], 'w') as f:
                f.write(dna + "\n")
                
if __name__ == '__main__':
    script = TransposeDna()

PHP

In PHP, all neceassary information is passed via the instance variables $this->param, $this->input and $this->output.

#! /usr/bin/env php
<?php

require_once ('include/php/proteomatic.php');

class TransposeDna extends ProteomaticScript
{
    function run()
    {
        // convert all characters to upper case
        $dna = strtoupper($this->param->nucleotides);
        // remove invalid characters
        $dna = preg_replace('/[^CGAT]/', '', $dna);
        // reverse sequence
        $dna = strrev($dna);
        // replace nucleotides
        $dna = strtr($dna, 'ACGT', 'TGCA');
        // output transposed DNA
        print($dna."\n");
        if (isset($this->output->result))
        {
            $f = fopen($this->output->result, 'w');
            fprintf($f, $dna."\n");
            fclose($f);
        }
    }
}

$script = new TransposeDna();
?>

Development tips

Configuration file

Before your scripts show up in the Proteomatic scripts menu, you have to tell Proteomatic where to look for your scripts. This information is stored in the configuration file proteomatic.conf.yaml. This file is stored in the Proteomatic data directory which is usually ~/.proteomatic, or, if you use a portable version, the program’s directory.

To determine the actual data directory, you may open the Preferences dialog (Ctrl+P), the data directory is shown in blue.

In the configuration file, you may specify additional scripts paths via the additionalScriptsPaths variable:

additionalScriptsPaths:
  - /home/michael/my-scripts/proteomatic-scripts

If you wish to move the managed scripts packages out of the way, you may rename or remove the scripts directory in the data directory. In addition, disable automatic script updates to prevent the managed scripts from re-appearing. Re-enabling the updates at a later time will re-create the scripts directory and everything will work as before.

Reloading scripts

If your script does not appear in the script menu or if you would like it to appear in another submenu, press Ctrl+R to reload the scripts menu. Because Proteomatic caches information to gain speed, it is usually necessary to also touch the script file when only the description has been changed (The modification timestamp of the script file is used to determine whether the cached information needs to be updated).

If your script is already depicted as a script box on the canvas, changes to the code take immediate effect because the script is simply called by the GUI. However, in order to reflect parameter changes which affect the GUI, it is necessary to reload the script from the scripts menu.

Troubleshooting

If your script does not appear in the scripts menu, run it from the command line. When Proteomatic builds the script menu (press Ctrl+R to force a rebuild), it calls every script with ---yamlInfo --short, unless the information in the cache is still valid. Try to run your script with ---yamlInfo --short from the command line and see what happens.

When a script is picked from the scripts menu, Proteomatic calls the script with ---yamlInfo to receive all information, including parameters.

If there is information missing in the script’s YAML description, the script will exit with an error and Proteomatic will silently skip the script. When called from the command line, the output messages show what went wrong.