Spaces:
Running
Running
<!-- % ============================================================================== --> | |
<!-- % =========================== Clustering ======================================= --> | |
<!-- % ============================================================================== --> | |
## Clustering {#sec-sec_2_3_Clustering} | |
In this section, the second step, the clustering of all trajectories $(\vec{\beta})$, is explained. | |
The main idea is to represent $F(\vec{\beta})$ through movement on centroids. | |
The data and workflow of clustering are very similar to the previous step of the data generation. It can be comprehended with figure @fig-fig_12_Clust . | |
All settings for this step can be individually configured in *settings.py*. | |
The $F(\vec{\beta})$ and cluster-algorithm specific parameters are filtered and provided to the clustering algorithm. The solutions are plotted and both, the plots and the clustered output are saved.\newline | |
<!-- % ============================================== --> | |
<!-- % ========== Clustering Workflow =============== --> | |
<!-- % ============================================== --> | |
{#fig-fig_12_Clust} | |
Data clustering is an unsupervised machine learning technique. | |
There are a variety of approaches that may be used for this, e.g., | |
k-means, affinity propagation, mean shift, | |
spectral clustering and Gaussian mixtures. All the | |
methods differ in their use cases, scalability, | |
metric or deployed norm and required input parameters. The latter | |
is an indicator of customization abilities. Since k-means can be used for very large | |
data sets and enables easy and fast implementation, k-means is preferred. Furthermore, David Arthur et al. | |
[@Arthur2006] introduced k-means++, which is known | |
to outperform k-means. Therefore, \gls{cnmc} uses k-means++ | |
as its default method for data clustering. | |
Note, applying k-means++ is not new in \gls{cnmc}, but it was already applied in the regular \gls{cnm} [@Fernex2021].\newline | |
In order to | |
cover the basics of k-means and k-means++, two terms | |
should be understood. | |
Picturing a box with 30 points in it, where 10 are located on the left, 10 | |
in the middle and 10 on the right side of the box. Adhering to such a | |
constellation, it is appealing to create 3 groups, one for | |
each overall position (left, center and right). Each group would | |
contain 10 points. These groups are called clusters and the | |
geometrical center of each cluster is called a centroid. | |
A similar thought experiment is visually depicted in [@Sergei_Visual]. | |
Considering a dynamical system, the trajectory is retrieved by integrating the \gls{ode} numerically at discrete time steps. | |
For each time step the obtained point is described with one x-, y- and z-coordinate. | |
Applying the above-mentioned idea on, e.g., the Lorenz system [@lorenz1963deterministic], defined as the set of equations in @eq-eq_6_Lorenz, then the resulting centroids can be seen in figure @fig-fig_13_Clust . | |
The full domains of the groups or clusters are color-coded in figure @fig-fig_14_Clust .\newline | |
::: {layout-ncol=2} | |
{#fig-fig_13_Clust} | |
{#fig-fig_14_Clust} | |
::: | |
Theoretically, | |
the points which are taken to calculate a center could be assigned | |
weighting factors. However, this is not done in \gls{cnmc} and therefore only | |
be outlined as a side note. After being familiar with the concept of | |
clusters and centroids, the actual workflow of k-means shall be explained. | |
For initializing | |
k-means, a number of clusters and an initial guess for the centroid | |
positions must be provided. Next, the distance between all the data | |
points and the centroids is calculated. The data points closest to a | |
centroid are assigned to these respective clusters. In other words, each data point is assigned to that cluster for which | |
the corresponding centroid exhibits the smallest distance | |
to the considered data point. | |
The geometrical mean value for all clusters is subsequently determined for all cluster-associated residents' data points. With the | |
new centroid positions, the clustering is | |
performed again. \newline | |
Calculating the mean of the clustered | |
data points (centroids) and performing clustering based on the | |
distance between each data point and the centroids | |
is done iteratively. The iterative process stops when | |
the difference between the prior and current | |
centroids position is equal to zero or | |
satisfies a given threshold. Other explanations with pseudo-code and | |
visualization for better understanding can be found[@Frochte2020] | |
and [@Sergei_Visual], respectively\newline | |
<!-- % ------------------------------------------------------------------------------ --> | |
<!-- % --------------------- PART 2 ------------------------------------------------- --> | |
<!-- % ------------------------------------------------------------------------------ --> | |
Mathematically k-means objective can be expressed | |
as an optimization problem with the centroid | |
position $\boldsymbol{\mu_j}$ as the design variable. That is given in equation | |
@eq-eq_1_k_means (extracted from [@Frochte2020]), where | |
$\boldsymbol{\mu_J}$ and $\mathrm{D}^{\prime}_j$ denote the centroid or | |
mean of the *j*th cluster and the data points | |
belonging to the *j*th cluster, respectively. | |
The distance between all the *j*th cluster data points | |
and its corresponding *j*th centroid is | |
stated as $\mathrm{dist}(\boldsymbol{x}_j, \boldsymbol{\mu}_j)$. | |
$$ | |
\begin{equation} | |
\label{eq_1_k_means} | |
\underset{\boldsymbol{\mu}_j}{\mathrm{argmin}}\sum_{j=1}^k \; \sum_{\boldsymbol{x}_j \in \mathrm{D}^{\prime}_j } | |
\mathrm{dist}(\boldsymbol{x}_j, \boldsymbol{\mu_j}) | |
\end{equation} | |
$$ {#eq-eq_1_k_means} | |
Usually, the k-means algorithm is deployed with a Euclidean metric | |
and equation @eq-eq_1_k_means becomes @eq-eq_2_k_Means_Ly, which | |
is known as the Lloyd algorithm [@Frochte2020; @Lloyd1982]. The | |
Lloyd algorithm can be understood as the minimization of the variance. | |
Thus, it is not necessarily true that k-means is equivalent to reducing | |
the variance. It is only true when the Euclidean norm is used. | |
$$ | |
\begin{equation} | |
\label{eq_2_k_Means_Ly} | |
\underset{\boldsymbol{\mu}_j}{\mathrm{argmin}}\sum_{j=1}^k \; \sum_{\boldsymbol{x}_j \in \mathrm{D}^{\prime}_j } | |
\| \boldsymbol x_j - \boldsymbol{\mu_j} \|^2 | |
\end{equation} | |
$$ {#eq-eq_2_k_Means_Ly} | |
The clustering algorithm highly depends on the provided | |
initial centroids positions. Since in most cases, these | |
are guessed, there is no guarantee of a reliable outcome. | |
Sergei Vassilvitskii, one of the founders of | |
k-means++, says in one of his presentations [@Sergei_Black_Art], | |
finding a good set of initial points would be black art. | |
Arthur et al. [@Arthur2006] state, | |
that the speed and simplicity of k-means would be appealing, not | |
its accuracy. There are many natural examples for which | |
the algorithm generates arbitrarily bad clusterings [@Arthur2006].\newline | |
An alternative or improved version of k-means is the already | |
mentioned k-means++, which | |
only differs in the initialization step. Instead of providing | |
initial positions for all centroids, just one centroid's | |
position is supplied. The remaining are calculated based on | |
maximal distances. In concrete, the distance between all | |
data points and the existing centroids is computed. The data point | |
which exhibits the greatest distance is added to the | |
list of collected centroids. This is done until all $k$ | |
clusters are generated. A visual depiction of this process | |
is given by Sergei Vassilvitskii in [@Sergei_Visual]. | |
Since the outcome of k-means++ is more reliable than | |
k-means, k-means++ is deployed in \gls{cnmc}.\newline | |
After having discussed some basics of k-means++, it shall be | |
elaborated on how and why the solution of the dynamical system should be | |
clustered. The solution of any dynamical system returns a trajectory. | |
If the trajectory repeats itself or happens to come close | |
to prior trajectories without actually touching them, | |
characteristic sections can be found. | |
Each characteristic section in the phase space is | |
captured by a centroid. The movement from one | |
centroid to another is supposed to portray the original | |
trajectory. With a clustering algorithm, these representative | |
characteristic locations in the phase space are obtained. | |
Since the clusters shall capture an entire trajectory, it is | |
evident that the number of clusters is an | |
essential parameter to choose. Latter fact becomes even | |
more clear when recalling that a trajectory can be multi-modal or complex.\newline | |
In the case of a highly non-linear | |
trajectory, it is obvious that many clusters are demanded in | |
order to minimize the loss of the original | |
trajectories. The projection of the real trajectory | |
to a cluster-based movement can be compared to | |
a reduced-order model of the trajectory. In this context, | |
it is plausible to refer to the centroids as | |
representative characteristic locations. Furthermore, \gls{cnm} and thus, \gls{cnmc}, exploits graph theory. | |
Therefore, the centroids can be denoted as nodes or characteristic nodes.\newline | |
The remaining part of this section will be devoted exclusively to the application of \gls{cnmc}. First, the leveraged kmeans++ algorithm is from the machine learning *Python* library *Scikit-learn* [@scikit-learn]. | |
Crucial settings, e.g., the number of clusters $K$, the maximal number of iterations, the tolerance as a convergence criterion and the number of different centroid seeds with which k-means is executed. | |
The operator can decide if the clustering step shall be performed or skipped. | |
The path for outputs can be modified and generating plots is also optional. | |
For the clustering stage, there are 4 types of plots available. | |
Two types of plots are depicted in figures @fig-fig_13_Clust and @fig-fig_14_Clust . | |
With the generated HTML plots the same features as described in section [-@sec-sec_2_2_Data_Gen] are available, e .g., receiving more information through pop-up panels and | |
switching between a dark and white mode. | |
\newline | |
The other two types of charts are not displayed here because they are intended to be studied as HTML graphs where the output can be viewed from multiple angles. | |
The first type shows the clustered output of one system for two different $\beta$ next to each other. | |
The centroids are labeled randomly as will be shown in subsection [-@sec-subsec_2_2_1_Parameter_Study]. | |
Consequently, the centroids representing the immediate neighbors across the two separate $\beta$ have separate labels. | |
In the second remaining type of HTML graph, the closest centroids across the two different $\beta$ are connected through lines. | |
Also, in the same way, as it was done for the first step in the data generation an additional HTML file containing all $\vec{\beta }$ charts is generated. | |
\newline | |
It can be concluded that the clustering step is performed by employing *Scikit-learn's* kmeans++ implementation, which is well suited for a great number of points. As usual, all important settings can be controlled with *settings.py*. | |
### Parameter Study {#sec-subsec_2_2_1_Parameter_Study} | |
In this subsection, the effects on the clustering step caused by the parameter *n\_init* shall be named. After that, the random labeling of the centroids is to be highlighted. | |
With the parameter *n\_init* it is possible to define how often the k-means algorithm will be executed with different centroid seeds [@scikit-learn]. | |
For a reliable clustering quality *n\_init* should be chosen high. However, the drawback is that with increasing *n\_init* the calculation time increases unacceptably high. Having chosen *n\_init* too high, the clustering part becomes the new bottleneck of the entire \gls{cnmc} pipeline. \newline | |
To conduct the parameter study, clustering was performed using the following *n\_init* values: | |
$\textit{n\_init} = \{100,\, 200, \, 400,\, 800,\, 1000, \, 1200, \, 1500 \}$. | |
Some results are presented in figures @fig-fig_15 to @fig-fig_20 . | |
It can be observed that when all the different *n\_init* values are compared, visually no big difference in the placing of the centroids can be perceived. | |
A graphical examination is sufficient because even with *n\_init* values that differ by only by the number one ($n_{init,1} - n_{init,2} = 1$), the centroid positions are still expected to vary slightly. | |
Simply put, only deviations on a macroscopic scale, which can be noted visually are searched for. As a conclusion it can be said that $\textit{n\_init} = 100$ and $\textit{n\_init} = 1500$ can be considered as an equally valuable clustering outcome. | |
Hence, *n\_init* the computational expense can be reduced by deciding on a reasonable small value for *n\_init*.\newline | |
The second aim of this subsection was to highlight the fact that the centroids are labeled randomly. For this purpose, the depicted figures @fig-fig_15 to @fig-fig_20 shall be examined. Concretely, any of the referred figures can be compared with the remaining figures to be convinced that the labeling is not obeying any evident rule. | |
<!-- % ============================================================================== --> | |
<!-- % ============================ PLTS ============================================ --> | |
<!-- % ============================================================================== --> | |
::: {layout-ncol=2} | |
{#fig-fig_15} | |
{#fig-fig_16} | |
::: | |
::: {layout-ncol=2} | |
{#fig-fig_17} | |
{#fig-fig_18} | |
::: | |
::: {layout-ncol=2} | |
{#fig-fig_19} | |
{#fig-fig_20} | |
::: | |
<!-- % ============================================================================== --> | |
<!-- % ============================ PLTS ============================================ --> | |
<!-- % ============================================================================== --> | |