Workflow 3 - cluster the trajectory

Input: EnGen object featurized with reduced dimensionality (generated by Workflow2)

Output: Representative trajectory ensemble

Steps:

Import the featurized trajectory from Workflow2
Choose clustering technique: KMeans, Gaussian Mixture Models
Choose appropriate parameters for clustering (number of clusters)
Additionally filter trajectories
Extract the ensemble

[1]:

#required imports

from engens.core.EnGens import EnGen
from engens.core.ClustEn import *
import pickle as pk

Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.

Step 0 - Load data from Workflow2

[2]:

engen = None
with open("wf2_resulting_EnGen.pickle", "rb") as file:
    engen = pk.load(file)

[3]:

traj = engen.traj
ref = engen.ref
print("Using the trajectory {} and reference pdb file {}".format(traj, ref))

topology = engen.mdtrajref
print("The topology is:")
print(topology)

feat_dims = engen.dimred_data.shape[0]
print("The dimensionality of your featurization is {}".format(feat_dims))

feat = engen.featurizers[engen.chosen_feat_index]
print("You chose to featurize with")
print(feat.describe())

dimred_data = engen.dimred_data
print("After dimensionality reduction the dimension of your features is {}".format(dimred_data.shape[1]))

Using the trajectory pentapeptide-00-500ns-impl-solv.xtc and reference pdb file pentapeptide-impl-solv.pdb
The topology is:
<mdtraj.Topology with 1 chains, 5 residues, 94 atoms, 95 bonds>
The dimensionality of your featurization is 5001
You chose to featurize with
['COS(PHI 0 LEU 2)', 'SIN(PHI 0 LEU 2)', 'COS(PSI 0 TRP 1)', 'SIN(PSI 0 TRP 1)', 'COS(PHI 0 ALA 3)', 'SIN(PHI 0 ALA 3)', 'COS(PSI 0 LEU 2)', 'SIN(PSI 0 LEU 2)', 'COS(PHI 0 LEU 4)', 'SIN(PHI 0 LEU 4)', 'COS(PSI 0 ALA 3)', 'SIN(PSI 0 ALA 3)', 'COS(PHI 0 LEU 5)', 'SIN(PHI 0 LEU 5)', 'COS(PSI 0 LEU 4)', 'SIN(PSI 0 LEU 4)']
After dimensionality reduction the dimension of your features is 7

Step 1 - choose the clustering method

[4]:

#------------two clustering algorithms------------------#
#----------------choose and uncomment one----------------#
# Option 1 - choose Kmeans
'''
clustering = "KM"
cluster_method = clusterings[clustering](engen, n_rep=2)

# Option 2 - choose GMMs
'''
clustering = "GMM"
cluster_method = clusterings[clustering](engen, n_rep=2)

Step 2 - run the clustering with different parameters

[5]:

# Create K clusters
if clustering =="KM":
    params = [{"n_clusters":i} for i in range(2, 10)]
    cluster_method.cluster_multiple_params(params)
else:
    params = [{"n_components":i} for i in range(2, 10)]
    cluster_method.cluster_multiple_params(params)

# analyze these parameters with the elbow method
cluster_method.analyze_elbow_method()

Clustering with params={'n_components': 2}
Clustering with params={'n_components': 3}
Clustering with params={'n_components': 4}
Clustering with params={'n_components': 5}
Clustering with params={'n_components': 6}
Clustering with params={'n_components': 7}
Clustering with params={'n_components': 8}
Clustering with params={'n_components': 9}

../_images/_notebooks_Workflow3-Clustering_8_1.png

Optimal params={'n_components': 3}

[6]:

# analyze these parameters with the silhouette method
cluster_method.analyze_silhouette()

Best parameters from silhouette analysis are {'n_components': 2}

[6]:

{'n_components': 2}

../_images/_notebooks_Workflow3-Clustering_9_2.png

../_images/_notebooks_Workflow3-Clustering_9_3.png

../_images/_notebooks_Workflow3-Clustering_9_4.png

../_images/_notebooks_Workflow3-Clustering_9_5.png

../_images/_notebooks_Workflow3-Clustering_9_6.png

../_images/_notebooks_Workflow3-Clustering_9_7.png

../_images/_notebooks_Workflow3-Clustering_9_8.png

../_images/_notebooks_Workflow3-Clustering_9_9.png

[7]:

# pick the number of clusters
n = 3
cluster_method.choose_n(n)

Step 3 - optionally pick a subset of clusters with the heighest weight

[8]:

cluster_method.plot_cluster_weight()

../_images/_notebooks_Workflow3-Clustering_12_0.png

[9]:

cluster_method.choose_clusters(thr=0.007)

Chosen cluster ids: [0, 1, 2]

../_images/_notebooks_Workflow3-Clustering_13_1.png

Step 4 - extract conformations for the ensemble

[10]:

# mode can be
# 1 - "center" and the representatives will be the points closest to the center of the clusters
# 2 - "hub" and the representatives will be the points with the most neighbors

mode = "center"
cluster_method.choose_conformations(mode=mode)

Choosing points closest to the cluster center
0
1
2

../_images/_notebooks_Workflow3-Clustering_15_1.png

[11]:

cluster_method.chosen_frames

[11]:

array([ 983, 3838, 4030])

[12]:

ensemble_location = "./res_ensemble"
cluster_method.extract_conformations(ensemble_location)

Closest conformation inside cluster 0 frame 983 of the striped trajectory
Extracting and saving to ./res_ensemble/conf_in_cluster_0.pdb
Closest conformation inside cluster 1 frame 3838 of the striped trajectory
Extracting and saving to ./res_ensemble/conf_in_cluster_1.pdb
Closest conformation inside cluster 2 frame 4030 of the striped trajectory
Extracting and saving to ./res_ensemble/conf_in_cluster_2.pdb

Step 5 - save results for analysis

[13]:

with open("wf3_resulting_EnGen.pickle", "wb") as file:
    pk.dump(engen, file, -1)

with open("wf3_resulting_Clust.pickle", "wb") as file:
    pk.dump(cluster_method, file, -1)