Spaces:
Running
Running
| <!-- % ============================================================================== --> | |
| <!-- % =========================== Clustering ======================================= --> | |
| <!-- % ============================================================================== --> | |
| ## Clustering {#sec-sec_2_3_Clustering} | |
| In this section, the second step, the clustering of all trajectories $(\vec{\beta})$, is explained. | |
| The main idea is to represent $F(\vec{\beta})$ through movement on centroids. | |
| The data and workflow of clustering are very similar to the previous step of the data generation. It can be comprehended with figure @fig-fig_12_Clust . | |
| All settings for this step can be individually configured in *settings.py*. | |
| The $F(\vec{\beta})$ and cluster-algorithm specific parameters are filtered and provided to the clustering algorithm. The solutions are plotted and both, the plots and the clustered output are saved.\newline | |
| <!-- % ============================================== --> | |
| <!-- % ========== Clustering Workflow =============== --> | |
| <!-- % ============================================== --> | |
| {#fig-fig_12_Clust} | |
| <!--%--> | |
| <!--%--> | |
| Data clustering is an unsupervised machine learning technique. | |
| There are a variety of approaches that may be used for this, e.g., | |
| k-means, affinity propagation, mean shift, | |
| spectral clustering and Gaussian mixtures. All the | |
| methods differ in their use cases, scalability, | |
| metric or deployed norm and required input parameters. The latter | |
| is an indicator of customization abilities. Since k-means can be used for very large | |
| data sets and enables easy and fast implementation, k-means is preferred. Furthermore, David Arthur et al. | |
| [@Arthur2006] introduced k-means++, which is known | |
| to outperform k-means. Therefore, \gls{cnmc} uses k-means++ | |
| as its default method for data clustering. | |
| Note, applying k-means++ is not new in \gls{cnmc}, but it was already applied in the regular \gls{cnm} [@Fernex2021].\newline | |
| In order to | |
| cover the basics of k-means and k-means++, two terms | |
| should be understood. | |
| Picturing a box with 30 points in it, where 10 are located on the left, 10 | |
| in the middle and 10 on the right side of the box. Adhering to such a | |
| constellation, it is appealing to create 3 groups, one for | |
| each overall position (left, center and right). Each group would | |
| contain 10 points. These groups are called clusters and the | |
| geometrical center of each cluster is called a centroid. | |
| A similar thought experiment is visually depicted in [@Sergei_Visual]. | |
| Considering a dynamical system, the trajectory is retrieved by integrating the \gls{ode} numerically at discrete time steps. | |
| For each time step the obtained point is described with one x-, y- and z-coordinate. | |
| Applying the above-mentioned idea on, e.g., the Lorenz system [@lorenz1963deterministic], defined as the set of equations in @eq-eq_6_Lorenz, then the resulting centroids can be seen in figure @fig-fig_13_Clust . | |
| The full domains of the groups or clusters are color-coded in figure @fig-fig_14_Clust .\newline | |
| ::: {layout-ncol=2} | |
| {#fig-fig_13_Clust} | |
| {#fig-fig_14_Clust} | |
| ::: | |
| Theoretically, | |
| the points which are taken to calculate a center could be assigned | |
| weighting factors. However, this is not done in \gls{cnmc} and therefore only | |
| be outlined as a side note. After being familiar with the concept of | |
| clusters and centroids, the actual workflow of k-means shall be explained. | |
| For initializing | |
| k-means, a number of clusters and an initial guess for the centroid | |
| positions must be provided. Next, the distance between all the data | |
| points and the centroids is calculated. The data points closest to a | |
| centroid are assigned to these respective clusters. In other words, each data point is assigned to that cluster for which | |
| the corresponding centroid exhibits the smallest distance | |
| to the considered data point. | |
| The geometrical mean value for all clusters is subsequently determined for all cluster-associated residents' data points. With the | |
| new centroid positions, the clustering is | |
| performed again. \newline | |
| Calculating the mean of the clustered | |
| data points (centroids) and performing clustering based on the | |
| distance between each data point and the centroids | |
| is done iteratively. The iterative process stops when | |
| the difference between the prior and current | |
| centroids position is equal to zero or | |
| satisfies a given threshold. Other explanations with pseudo-code and | |
| visualization for better understanding can be found[@Frochte2020] | |
| and [@Sergei_Visual], respectively\newline | |
| <!-- % ------------------------------------------------------------------------------ --> | |
| <!-- % --------------------- PART 2 ------------------------------------------------- --> | |
| <!-- % ------------------------------------------------------------------------------ --> | |
| Mathematically k-means objective can be expressed | |
| as an optimization problem with the centroid | |
| position $\boldsymbol{\mu_j}$ as the design variable. That is given in equation | |
| @eq-eq_1_k_means (extracted from [@Frochte2020]), where | |
| $\boldsymbol{\mu_J}$ and $\mathrm{D}^{\prime}_j$ denote the centroid or | |
| mean of the *j*th cluster and the data points | |
| belonging to the *j*th cluster, respectively. | |
| The distance between all the *j*th cluster data points | |
| and its corresponding *j*th centroid is | |
| stated as $\mathrm{dist}(\boldsymbol{x}_j, \boldsymbol{\mu}_j)$. | |
| $$ | |
| \begin{equation} | |
| \label{eq_1_k_means} | |
| \underset{\boldsymbol{\mu}_j}{\mathrm{argmin}}\sum_{j=1}^k \; \sum_{\boldsymbol{x}_j \in \mathrm{D}^{\prime}_j } | |
| \mathrm{dist}(\boldsymbol{x}_j, \boldsymbol{\mu_j}) | |
| \end{equation} | |
| $$ {#eq-eq_1_k_means} | |
| Usually, the k-means algorithm is deployed with a Euclidean metric | |
| and equation @eq-eq_1_k_means becomes @eq-eq_2_k_Means_Ly, which | |
| is known as the Lloyd algorithm [@Frochte2020; @Lloyd1982]. The | |
| Lloyd algorithm can be understood as the minimization of the variance. | |
| Thus, it is not necessarily true that k-means is equivalent to reducing | |
| the variance. It is only true when the Euclidean norm is used. | |
| $$ | |
| \begin{equation} | |
| \label{eq_2_k_Means_Ly} | |
| \underset{\boldsymbol{\mu}_j}{\mathrm{argmin}}\sum_{j=1}^k \; \sum_{\boldsymbol{x}_j \in \mathrm{D}^{\prime}_j } | |
| \| \boldsymbol x_j - \boldsymbol{\mu_j} \|^2 | |
| \end{equation} | |
| $$ {#eq-eq_2_k_Means_Ly} | |
| The clustering algorithm highly depends on the provided | |
| initial centroids positions. Since in most cases, these | |
| are guessed, there is no guarantee of a reliable outcome. | |
| Sergei Vassilvitskii, one of the founders of | |
| k-means++, says in one of his presentations [@Sergei_Black_Art], | |
| finding a good set of initial points would be black art. | |
| Arthur et al. [@Arthur2006] state, | |
| that the speed and simplicity of k-means would be appealing, not | |
| its accuracy. There are many natural examples for which | |
| the algorithm generates arbitrarily bad clusterings [@Arthur2006].\newline | |
| An alternative or improved version of k-means is the already | |
| mentioned k-means++, which | |
| only differs in the initialization step. Instead of providing | |
| initial positions for all centroids, just one centroid's | |
| position is supplied. The remaining are calculated based on | |
| maximal distances. In concrete, the distance between all | |
| data points and the existing centroids is computed. The data point | |
| which exhibits the greatest distance is added to the | |
| list of collected centroids. This is done until all $k$ | |
| clusters are generated. A visual depiction of this process | |
| is given by Sergei Vassilvitskii in [@Sergei_Visual]. | |
| Since the outcome of k-means++ is more reliable than | |
| k-means, k-means++ is deployed in \gls{cnmc}.\newline | |
| After having discussed some basics of k-means++, it shall be | |
| elaborated on how and why the solution of the dynamical system should be | |
| clustered. The solution of any dynamical system returns a trajectory. | |
| If the trajectory repeats itself or happens to come close | |
| to prior trajectories without actually touching them, | |
| characteristic sections can be found. | |
| Each characteristic section in the phase space is | |
| captured by a centroid. The movement from one | |
| centroid to another is supposed to portray the original | |
| trajectory. With a clustering algorithm, these representative | |
| characteristic locations in the phase space are obtained. | |
| Since the clusters shall capture an entire trajectory, it is | |
| evident that the number of clusters is an | |
| essential parameter to choose. Latter fact becomes even | |
| more clear when recalling that a trajectory can be multi-modal or complex.\newline | |
| In the case of a highly non-linear | |
| trajectory, it is obvious that many clusters are demanded in | |
| order to minimize the loss of the original | |
| trajectories. The projection of the real trajectory | |
| to a cluster-based movement can be compared to | |
| a reduced-order model of the trajectory. In this context, | |
| it is plausible to refer to the centroids as | |
| representative characteristic locations. Furthermore, \gls{cnm} and thus, \gls{cnmc}, exploits graph theory. | |
| Therefore, the centroids can be denoted as nodes or characteristic nodes.\newline | |
| The remaining part of this section will be devoted exclusively to the application of \gls{cnmc}. First, the leveraged kmeans++ algorithm is from the machine learning *Python* library *Scikit-learn* [@scikit-learn]. | |
| Crucial settings, e.g., the number of clusters $K$, the maximal number of iterations, the tolerance as a convergence criterion and the number of different centroid seeds with which k-means is executed. | |
| The operator can decide if the clustering step shall be performed or skipped. | |
| The path for outputs can be modified and generating plots is also optional. | |
| For the clustering stage, there are 4 types of plots available. | |
| Two types of plots are depicted in figures @fig-fig_13_Clust and @fig-fig_14_Clust . | |
| With the generated HTML plots the same features as described in section [-@sec-sec_2_2_Data_Gen] are available, e .g., receiving more information through pop-up panels and | |
| switching between a dark and white mode. | |
| \newline | |
| The other two types of charts are not displayed here because they are intended to be studied as HTML graphs where the output can be viewed from multiple angles. | |
| The first type shows the clustered output of one system for two different $\beta$ next to each other. | |
| The centroids are labeled randomly as will be shown in subsection [-@sec-subsec_2_2_1_Parameter_Study]. | |
| Consequently, the centroids representing the immediate neighbors across the two separate $\beta$ have separate labels. | |
| In the second remaining type of HTML graph, the closest centroids across the two different $\beta$ are connected through lines. | |
| Also, in the same way, as it was done for the first step in the data generation an additional HTML file containing all $\vec{\beta }$ charts is generated. | |
| \newline | |
| It can be concluded that the clustering step is performed by employing *Scikit-learn's* kmeans++ implementation, which is well suited for a great number of points. As usual, all important settings can be controlled with *settings.py*. | |
| ### Parameter Study {#sec-subsec_2_2_1_Parameter_Study} | |
| In this subsection, the effects on the clustering step caused by the parameter *n\_init* shall be named. After that, the random labeling of the centroids is to be highlighted. | |
| With the parameter *n\_init* it is possible to define how often the k-means algorithm will be executed with different centroid seeds [@scikit-learn]. | |
| For a reliable clustering quality *n\_init* should be chosen high. However, the drawback is that with increasing *n\_init* the calculation time increases unacceptably high. Having chosen *n\_init* too high, the clustering part becomes the new bottleneck of the entire \gls{cnmc} pipeline. \newline | |
| To conduct the parameter study, clustering was performed using the following *n\_init* values: | |
| $\textit{n\_init} = \{100,\, 200, \, 400,\, 800,\, 1000, \, 1200, \, 1500 \}$. | |
| Some results are presented in figures @fig-fig_15 to @fig-fig_20 . | |
| It can be observed that when all the different *n\_init* values are compared, visually no big difference in the placing of the centroids can be perceived. | |
| A graphical examination is sufficient because even with *n\_init* values that differ by only by the number one ($n_{init,1} - n_{init,2} = 1$), the centroid positions are still expected to vary slightly. | |
| Simply put, only deviations on a macroscopic scale, which can be noted visually are searched for. As a conclusion it can be said that $\textit{n\_init} = 100$ and $\textit{n\_init} = 1500$ can be considered as an equally valuable clustering outcome. | |
| Hence, *n\_init* the computational expense can be reduced by deciding on a reasonable small value for *n\_init*.\newline | |
| The second aim of this subsection was to highlight the fact that the centroids are labeled randomly. For this purpose, the depicted figures @fig-fig_15 to @fig-fig_20 shall be examined. Concretely, any of the referred figures can be compared with the remaining figures to be convinced that the labeling is not obeying any evident rule. | |
| <!-- % ============================================================================== --> | |
| <!-- % ============================ PLTS ============================================ --> | |
| <!-- % ============================================================================== --> | |
| ::: {layout-ncol=2} | |
| {#fig-fig_15} | |
| {#fig-fig_16} | |
| ::: | |
| ::: {layout-ncol=2} | |
| {#fig-fig_17} | |
| {#fig-fig_18} | |
| ::: | |
| ::: {layout-ncol=2} | |
| {#fig-fig_19} | |
| {#fig-fig_20} | |
| ::: | |
| <!-- % ============================================================================== --> | |
| <!-- % ============================ PLTS ============================================ --> | |
| <!-- % ============================================================================== --> | |