Motif Orientation and Characterization for High-throughput Assays (MOCHA)

Description:

For enzymes that interact with protein and peptide substrates, this program will take profiling data for a given enzyme, and identify the Motif, or the recognition site, that is present within the larger protein sequence. The Motif is identified by the positions in the substrate that have an Entropy score (∆S) ≥ a user defined minimum ∆S value.

∆S is evaluated at each position in the substrate sequence and is found by calculating the difference between the Maximum Entropy (S_Max) and the Shannon Entropy (S_Shannon).

∆S = S_Max - S_Shannon = log₂(20) - ∑(-prob_AA × log₂(prob_AA))

Once the Motif has been have identified, the substrates are binned by selecting only the recognition sequence. In other words, the parts of the sequence that are not important for an Enzyme-Substrate interaction are removed.
- The non-important positions are identified by meeting the condition: ∆S < minimum ∆S

The program will then count the occurrences of each Motif, and collect the top "N" number of sequences. The subset is used to make Bar Graphs, and a Word Cloud. These figures display the most common combinations of active substrates within your dataset.

We can further evaluate the top Motifs by feeding the data into a Suffix Tree. This analysis will select the specific residues within the Motif, and plot the amino acids as nodes, with lines connecting the observed combinations of residues across the Motif. The figure reveals the unique preferences for a given amino acid when another is present in a preferred substrate.

Running The Program:

This website was deployed on railway.com with the free trial version service (limited amount of RAM).
- This limits the website's abiliy to evaluate your data, which may result in an error message after submitting a job.
- For optimal performance, clone the MOCHA repository and run the program in a Python IDE.
- Sourse Code is available at: github.com/Collinformatics/MOCHA

Requirements: Input data
- All protein sequences must have the same sequence length.
- An uploaded file can only contain one sequence per line.

Uploaded files must be formatted as .txt
- An example of a correctly formatted text file is provided on GitHub.

In order to accurately identify a Motif, it is recommended to apply a filter to your data before uploading the file.
- Example: Use a set of known substrates that contain a preferred residue at a given position in the sequence.

For a guided explanation of how the analysis works, press the "Evaluate" button without uploading a Text file. This will use run the analysis on a template dataset.
- The evalation will take a few seconds to complete.
- Once finished, the results will be displayed at the bottom of the page.

Citations:

1) Crooks, G. E., et al. Genome Research. (2004)

2) Lesne, Annick. Mathematical Structures in Computer Science. (2014)