Workflow 1 - extract features from the trajectory

Input: reference PDB and trajectory

Output: featurized trajectory

Steps:

Load reference PDB and trajectory in the EnGen object
Provide set of featurizations of interest (or use default)
Evaluate different featurization (optional)
Choose the best featurization
Extract those features

[1]:

# required imports
from engens.core.EnGens import *
import engens.core.FeatureSelector as fs
import pickle
import mdshare
import mdtraj
import numpy as np
import nglview
from IPython.display import Javascript, display
import json

Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
Matplotlib is building the font cache; this may take a moment.

Step 1 - load the structure and trajectory

Provide the path to the files with the reference trajectory and topology. (You can use any format that mdtraj.load will take as input).

Optionally, provide a subset of the structure that you will use for featurization (e.g. binding site) as a atom selection string or a list of atom indices.

[2]:

# If you do not have trajectories of your own you can fetch this pentapeptide example
# or any other example from mdshare platform: https://markovmodel.github.io/mdshare/

pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='.')
files = mdshare.fetch('pentapeptide-00-500ns-impl-solv.xtc', working_directory='.')

[3]:

# Location of trajectory and topology files
# (you can specify your own files here)
top_loc = 'pentapeptide-impl-solv.pdb'
traj_loc = 'pentapeptide-00-500ns-impl-solv.xtc'

#Instrantiate the engens object

#------------------------four options----------------------------#
#-----you can uncomment any option that you would like to try-----#

#Option 1 - load the full trajectory as a EnGen object

#load full trajectory

engen = EnGen(traj_loc, top_loc)


# you can align the input trajectory by specifying the align flag to be true:
# engen = EnGen(traj_loc, top_loc, align = True)



#Option 2 - load a part of the trajectory by specifying a list of atoms of interest (binding site)
#for example - first 10 atoms of the structure
'''
#load first 10 atoms of the trajectory
N = 10
select_list = [i for i in range(N)]
engen = EnGen(traj_loc, top_loc, select_list)
'''

#Option 3 - load a part of the trajectory by specifying a selection string (binding site)
#using atom selection format https://mdtraj.org/1.9.4/atom_selection.html
#for example - you can specify residues with id 1, 2, 3
'''
#load residues 1,2,3 of the trajectory
binding_site_selstr = "resid 1 or resid 2 or resid 3"
engen = EnGen(traj_loc, top_loc, binding_site_selstr)
'''

#Option 4 - load a part of the trajectory by selecting residues/atoms interactively
# press CTRL and hover over the residues of interest to select them
# press SHIFT and hover over residues to unselect them
# to delete the selection input text "none"
# into the filter for selection on the right side of the menu
# nglwidget = select_residues_nglview(top_loc)
# nglwidget

[3]:

'\n#load residues 1,2,3 of the trajectory\nbinding_site_selstr = "resid 1 or resid 2 or resid 3"\nengen = EnGen(traj_loc, top_loc, binding_site_selstr)\n'

[4]:

## Option 4 - continue selection 1
selection = None
display(Javascript(js_script))

[5]:

## Option 4 - continue selection 2
#binding_site_selstr = get_selstring(selection)
#binding_site_selstr = "(10 <= resid) and (resid <= 50)"
#engen = EnGen(traj_loc, top_loc, binding_site_selstr, align = True)

#------------------------end of options----------------------------#

[6]:

#visualize the trajectory (optional - if trajectory too large, skip this step)
nglwidget = engen.show_animated_traj()
nglwidget.clear_representations()
nglwidget.add_ball_and_stick()
nglwidget.center()
nglwidget

Step 2 - select different featurizations

Here we select ways to featurize the trajectory. Any PyEmma trajectory featurization can be used in this step and any of the parameters of the respective featurizations can be provided. Users can also use the default initialization which includes three sets of features: (1) amino-acid pairwise distances; (2) torsion angles and (3) amino-acid pairwise distances with the torsion angels.

[7]:

# Select the featurization for EnGens
# (this step is performed on the selection you loaded above)

#------------------------two options----------------------------#
#-----you can uncomment any option that you would like to try-----#

# Option 1 - extract the default features
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# in this case the three default featurizations  will be added to engen object:
# (1) aadist, (2) tangle and (3) aadist&tangle
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #


# remove any existing featurizers
engen.reset_featurizers()

# initialize default features
#engen.init_featurizers_default()


# Option 2 - define the features you want to add on your own
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# using any of the available featurizationswith PyEmma
# http://www.emma-project.org/latest/api/generated/pyemma.coordinates.featurizer.html
# with coresponding parameters
# in this example we add three featurizations to the engen object:
# (1) all-atoms, (2) C-alpha distances, (3) center-of-mass (COM) & tangles
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #


# remove any existing featurizers

engen.reset_featurizers()

#all-atoms
feat1 = {
    "add_sidechain_torsions": {}
}
#add the respective features to the engen structure
engen.add_featurizer(feat1)

#C-alpha distances
feat2 = {
    "add_backbone_torsions": {"cossin":True, "periodic":False}
}
#add the respective features to the engen structure
engen.add_featurizer(feat2)

#center of mass and torsion angles
feat3 = {
    "add_residue_COM": {"residue_indices": [0, 1,2]},
    "add_distances_ca": {"periodic":True, "excluded_neighbors":0}
}
#add the respective features to the engen structure
engen.add_featurizer(feat3)



#------------------------end of options----------------------------#

# print the desctiption of the default features
description = engen.describe_featurizers()
print(description)

Featurizer no. 0:
 sidechain_torsions
['CHI1 0 TRP 1', 'CHI1 0 LEU 2', 'CHI1 0 LEU 4', 'CHI1 0 LEU 5', 'CHI2 0 TRP 1', 'CHI2 0 LEU 2', 'CHI2 0 LEU 4', 'CHI2 0 LEU 5']...['CHI1 0 TRP 1', 'CHI1 0 LEU 2', 'CHI1 0 LEU 4', 'CHI1 0 LEU 5', 'CHI2 0 TRP 1', 'CHI2 0 LEU 2', 'CHI2 0 LEU 4', 'CHI2 0 LEU 5']
 Featurizer no. 1:
 backbone_torsions
['COS(PHI 0 LEU 2)', 'SIN(PHI 0 LEU 2)', 'COS(PSI 0 TRP 1)', 'SIN(PSI 0 TRP 1)', 'COS(PHI 0 ALA 3)', 'SIN(PHI 0 ALA 3)', 'COS(PSI 0 LEU 2)', 'SIN(PSI 0 LEU 2)', 'COS(PHI 0 LEU 4)', 'SIN(PHI 0 LEU 4)']...['COS(PSI 0 LEU 2)', 'SIN(PSI 0 LEU 2)', 'COS(PHI 0 LEU 4)', 'SIN(PHI 0 LEU 4)', 'COS(PSI 0 ALA 3)', 'SIN(PSI 0 ALA 3)', 'COS(PHI 0 LEU 5)', 'SIN(PHI 0 LEU 5)', 'COS(PSI 0 LEU 4)', 'SIN(PSI 0 LEU 4)']
 Featurizer no. 2:
 residue_COM&distances_ca
['TRP1 COM-x (all)', 'TRP1 COM-y (all)', 'TRP1 COM-z (all)', 'LEU2 COM-x (all)', 'LEU2 COM-y (all)', 'LEU2 COM-z (all)', 'ALA3 COM-x (all)', 'ALA3 COM-y (all)', 'ALA3 COM-z (all)', 'DIST: TRP 1 CA 3 - LEU 2 CA 27']...['DIST: TRP 1 CA 3 - LEU 2 CA 27', 'DIST: TRP 1 CA 3 - ALA 3 CA 46', 'DIST: TRP 1 CA 3 - LEU 4 CA 56', 'DIST: TRP 1 CA 3 - LEU 5 CA 75', 'DIST: LEU 2 CA 27 - ALA 3 CA 46', 'DIST: LEU 2 CA 27 - LEU 4 CA 56', 'DIST: LEU 2 CA 27 - LEU 5 CA 75', 'DIST: ALA 3 CA 46 - LEU 4 CA 56', 'DIST: ALA 3 CA 46 - LEU 5 CA 75', 'DIST: LEU 4 CA 56 - LEU 5 CA 75']

/home/docs/checkouts/readthedocs.org/user_builds/engens/conda/latest/lib/python3.6/site-packages/pyemma/coordinates/data/featurization/angles.py:211: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  indices = np.vstack(valid.values())

Step 3 - evaluate the featurizations

This step is optional - we recommend evaluating the featurizations and picking the best using PyEmma’s implementation of VAMP approach .

This helps you choose a set of features with which to proceed to the next Workflow.

[8]:

#provide a range of lags and dimensions to train on

#to make an informed decision on lag times see the timescale of your simulation
traj_times = mdtraj.load(engen.traj, top=engen.ref).time
#number of frames
traj_nframes = len(traj_times)
lags = [int(np.log(i)) for i in np.logspace(100, int(traj_nframes/50), num=3, base=np.e)]
dims = [i + 1 for i in range(1,3)]

#initialize VAMP2 featurizer and run it
featsel = fs.VAMP2FeatureSelection(lags, dims, engen)
featsel.run_vamp()

#plot VAMP2 results
featsel.plot_results()

Choosing features with VAMP might take some time...
Generating data from featurizations
Running VAMP with different parameters. Might take some time.
dimension =2, lag=100
dimension =2, lag=100
dimension =2, lag=100
dimension =3, lag=100
dimension =3, lag=100
dimension =3, lag=100

/home/docs/checkouts/readthedocs.org/user_builds/engens/conda/latest/lib/python3.6/site-packages/engens/core/FeatureSelector.py:125: UserWarning: FixedFormatter should only be used together with FixedLocator
  axes[i][j].set_xticklabels(names, rotation=30)

../_images/_notebooks_Workflow1-FeatureExtraction_11_2.png

Step 4 - pick the featurization

We suggest using the featurization which gives you the highest VAMP2 score from the analysis above. To do so, run the cell below.

[9]:

# Select using the VAMP2 selector from above

featsel.select_feature()


# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# alternatively you can chose any feature regardless of the VAMP2 score by running the code commented here

'''
#apply features
engen.apply_featurizations()
#print possible features
print(engen.describe_featurizers())
#select the number of the desired feature
feat_num = 1
# initialize selector
featsel = fs.UserFeatureSelection(feat_num, engen)
#select the feature
featsel.select_feature()
'''
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
print()

Choosing features with VAMP might take some time...
Using recycled VAMP2 scores.
Picked featurized no. 1: backbone_torsions
With maximum VAMP2 score no. 1
['COS(PHI 0 LEU 2)', 'SIN(PHI 0 LEU 2)', 'COS(PSI 0 TRP 1)', 'SIN(PSI 0 TRP 1)', 'COS(PHI 0 ALA 3)', 'SIN(PHI 0 ALA 3)', 'COS(PSI 0 LEU 2)', 'SIN(PSI 0 LEU 2)', 'COS(PHI 0 LEU 4)', 'SIN(PHI 0 LEU 4)', 'COS(PSI 0 ALA 3)', 'SIN(PSI 0 ALA 3)', 'COS(PHI 0 LEU 5)', 'SIN(PHI 0 LEU 5)', 'COS(PSI 0 LEU 4)', 'SIN(PSI 0 LEU 4)']

Step 5 - save the results as input for Workflow2 - dimensionality reduction

[10]:

# save the results for next workflow

with open("wf1_resulting_EnGen.pickle", "wb") as file:
    pickle.dump(engen, file, -1)