thesis/ThesTeX/content/4-implementation.tex

Based on the findings in \autoref{sec:solution}, an implementation with Python was realized.
The following sections describe the structure and service composition utilized to fulfill the requirements.

\section{Code structure}
There are four packages forming the Analysis Framework project:
\begin{itemize}
	\item analysis: Core analysis functionality, including log parsing, analysis, postprocessing and rendering
	\item clients: Connection classes to game servers to retrieve log files and game configurations
	\item selector: Web interface for non-expert users
	\item tasks: Definition of asynchronous tasks
\end{itemize}
The analysis and clients packages are described in \autoref{sec:analysisframework}, while \autoref{sec:web} features selector and tasks packages.
\image{.7\textwidth}{packages}{Project package overview}{img:packages}

\subsection{Analysis Framework}\label{sec:analysisframework}
The internal structure of the analysis package is shown in \autoref{img:pack-analysis}.
Besides the sub-packages for analysing work (analyzers: \autoref{sec:analysiswork}) and log parsing (loaders: \autoref{sec:loaders}), it contains helper functionalities and finally the Python module \texttt{log\_analyzer} as entry point for researches experimenting and outline of the intended workflow.
\image{.7\textwidth}{packages-analysis}{analysis package overview}{img:pack-analysis}

\subsubsection{Log parsing}\label{sec:loaders}
Outlined in \autoref{img:pack-loader}, the parsing of log files into an internal structure happens here.
\image{.7\textwidth}{packages-loader}{loader package overview}{img:pack-loader}

\paragraph{The loader module} holds the definition of the abstract base class \texttt{Loader}.
It has two unimplemented methods: \texttt{load} and \texttt{get\_entry}.
While the first is issued with an filename as argument to load a log file, the second it then called repeatedly to retrieve a single log for the analysis steps.
Processing stops when all log entries have been passed from this method.

The module also defines a showcase implementation loading a JSON file and \texttt{yield}ing it's items.

\paragraph{Biogames} is for the log files of Biodiv2go, a composite approach was used: The games' log files come as ZIP archive with an SQLite database and possibly media files.
The \texttt{SQLiteLoader} contains the logic to handle a plain SQLite file according to the definition of the \texttt{Loader} from above.
By extending this class, \texttt{ZipSQLiteLoader} focuses on unzipping the archive and creating a temporary storage location, leaving interpretation of the data to its super class.
This avoids code duplication and, with little amount of tweaking, would present a generic way to handle SQLite database files.

\paragraph{Neocart(ographer)}
is the evaluation step described in \autoref{sec:eval}.
This \texttt{Loader} deals with some seriously broken XML files.

\paragraph{Module settings} are stored in the \texttt{\_\_init\_\_} module.
This is mainly a mapping to allow references to \texttt{Loader}s in the JSON files for configuration (see \autoref{sec:settings}).

\subsubsection{Analysis Work package}\label{sec:analysiswork}
\autoref{img:pack-analyzers} shows the sub-packages of \texttt{analysis.analyzers}.
There are sub-packages for doing the actual analysis work, as well as for the postprocess and rendering step.
Additional the \texttt{settings} module defines the LogSettings class.
\image{.7\textwidth}{packages-analysis-analyzers}{analysis.analyzers package overview}{img:pack-analyzers}

\paragraph{LogSettings}\label{sec:settings}
This class holds the configuration for an analysis run:
\begin{itemize}
	\item The type of the log parser to use
	\item Information about the structure of the parsed log files, e.g.
	\begin{itemize}
		\item What is the key of the field to derive the type of the log entry?
		\item What value does this field hold, when there is spatial information?
		\item What value does indicate game actions?
		\item What is the path to obtain spatial information from an spatial entry?
	\end{itemize}
	\item The analysis setup:
	\begin{itemize}
		\item Which analyzers to use,
		\item and the order to apply them
	\end{itemize}
	\item Variable data to configure the source (see \autoref{sec:source}).
	\item Rendering methods to apply to the result set
\end{itemize}

The settings are stored as JSON files, and parsed by runtime into a \texttt{LogSetting} object (see \autoref{img:oebkml} for a sample JSON settings file).
The helper functions in \texttt{analysis.util} provide a very basic implementation of an query language for Python dictionaries:
A dot-separated string defines the path to take through the dictionary, providing basically syntactic sugar to avoid lines like \texttt{entry["instance"]["config"]["@id"]}.
As this proves quite difficult to configure using JSON, the path-string \texttt{"instance.config.@id"} is much more deserialization friendly.

\paragraph{The Analyzer package} defines the work classes to extract information from log entries.
The packages' init-module defines the Result and ResultStore classes, as well as the abstract base class for the Analyzers.

As shown in \autoref{code:analyzer}, this base class provides the basic mechanics to access the settings.
The core feature of this project is condensed in the method stub \texttt{process}.
It is fed with an parsed entry from \autoref{sec:loaders}, processes it, possibly updates the internal state of the class, and the can decide to end the processing of the particular log entry or continue to feed down into the remainder of the analysis chain.

When all log entries of a log file are processed, the \texttt{result} method returns the findings of this analysis instance (see \autoref{par:result}).

\lstinputlisting[language=python,caption={Analyzer base class},label=code:analyzer]{code/analyzer.py}

There are 23 classes implementing analysis functionality, partitioned into modules for generic use, Biodiv2go analysis, and filtering purposes.

\paragraph{Results}\label{par:result} are stored in a \texttt{Result} object (\texttt{analysis.analyzers.analyzer.\_\_init\_\_}).
This class keeps track of the origin of the resulting data to allow filtering for results by arbitrary analyzing classes.

As \autoref{code:analyzer} shows, the \texttt{Result}s are stored in a \texttt{ResultStore}.
This store - defined next to the \texttt{Result} class - provides means to structure the results by arbitrary measures.
By passing the store's reference into the analyzers, any analyzer can introduce categorization measures.
This allows for example to distinguish several log files by name, or to combine log files and merge the results by events happening during the games' progress.
With an default of an dictionary of lists, the API supports a callable factory for arbitrary use.

\paragraph{Rendering of the Results} is done in the \texttt{render} package.
Similar to the Analyzers' package, the render package defines its common base class in the initialization module, as shown in \autoref{code:render}.
It provides implementer means to filter the result set to relevant analysis types through the \texttt{filter} methods.
Of course, the implementation of the rendering method is left open.

\lstinputlisting[language=python,caption={Render base class},label=code:render]{code/render.py}

There are 18 implementations, again split ted into generic and game-specific ones.

The most generic renderers just dump the results into JSON files or echo them to the console.
A more advanced implementation relies on the \texttt{LocationAnalyzer} and creates a KML file with a track animation (example: \autoref{img:oebge}).
Finally, e.g. \texttt{biogames.SimulationGroupRender} performs postprocessing steps on a collection of \texttt{biogames.SimulationOrderAnalyzer} results by creating a graph\furl{https://networkx.github.io/} rendered with matplotlib\furl{https://matplotlib.org/} to discover simulation retries (example: \autoref{img:retries}).

\subsection{Sources}\label{sec:source} of log files are clients connecting either to game servers directly or other log providers.
There is currently a bias towards HTTP clients, as REST APIs are todays go-to default.
To acknowledge this bias, the HTTP oriented base class is not defined at package level.
The \texttt{Client} originates from the \texttt{client.webclients} package instead.
It contains some convenience wrappers to add cookies, headers and URL-completion to HTTP calls as well as handling file downloads.
The two implementing classes are designed for Biodiv2go and a Geogames-Team log provider.
Using a REST API, the \texttt{Biogames} client integrates seamlessly into the authentication and authorization of the game server.
The client acts as proxy for users to avoid issues with cross-origin scripting (XSS) or resource Sharing (CORS).

The Geogames-Team's geogames like Neocartographer write game logs to files and only have a server running during the active game.
Therefore, an additional log providing server was created to allow access to the log files (see also: \autoref{sec:ggt-server}).

Clients can have arbitrary amounts of options, as all fields in the JSON settings file are passed through (see \autoref{img:oebkml}, section "source").

\subsection{Web Interface}\label{sec:web}
The selector package holds a Flask\furl{http://flask.pocoo.org/} app for an web interface for non-expert users.
It utilizes the provided clients (see \autoref{sec:source}) for authentication, and gives users the following options:
\begin{itemize}
	\item Exploring available game logs
	\item Configuring a new analysis run
	\item View analysis run status
	\item View analysis run results
\end{itemize}
The web interface offers all available clients for the user to choose from.
With user provided credentials, the server retrieves the available game logs and offers them, together with the predefined analysis options, to create an new analysis run.
When an analysis run is requested, the server issues a new task to be executed (see \autoref{sec:tasks}).

An overview page lists the status of the tasks from the given user, and provides access to the results once the task is finished.
When problems occur, the status page informs the user, too.

As Flask does not recommend serving static files trough itself, a Nginx HTTP server\furl{https://www.nginx.com/} is configured to serve the result files.

\subsubsection{User workflow}
The index page of the web UI features a login form.
It offers a selection for the different configured game backends (see \autoref{img:webindex}).

While a failed login stays at the index, a successful attempt redirects the user to the result overview (see \autoref{img:webresults}).
Here, the both the results of completed analysis runs and the status of scheduled and running jobs are visible.
For finished runs, there are links to the result artifacts.

The link \emph{create new analysis} leads to the configuration menu for new analysis runs (see \autoref{img:webcreate}).
It lists the game logs available for the logged in user, and offers a selection of the predefined analysis configurations.
With a given name, it is easy to identify the results for each analysis run in the result overview page.

\subsection{Task definition}\label{sec:tasks} in the \texttt{package} provides tasks available for execution.
This package is the interface for celery\furl{http://www.celeryproject.org/} workers and issuers.
The key point is the task \texttt{analyze} to start new analysis runs.
When a new task is scheduled, the issuer puts a task in the Redis DB\furl{https://redis.io/}.
A free worker node claims the task and executes it.
During the runtime, status updates are stored in the Redis Db to inform the issuer about progress, failures and results artifacts.


\section{Services \& Service composition}

Following the implementation above, the following services are necessary:
\begin{itemize}
	\item Analysis framework: Celery
	\item User interface: Flask
	\item Result server: Nginx
	\item Connection Flask - Celery: Redis
	\item Public frontend: Traefik (external)
\end{itemize}
Two additional services were used, one for a local BioDiv2Go server, one as log provider for the Neocartographer logs.

The services are managed using Docker\furl{https://www.docker.com/}.
This provides a clear ground for development as well as a easily integrable solution.
Although docker as technology may be a current hype, the build scripts in human readable format provide documentation about dependencies and installation steps if necessary.

\subsection{Background worker: Celery}\label{sec:srv-celery}
The Celery worker process provides the tasks defined in \autoref{sec:tasks}.
Therefore, it requires all the analysis tools, access to the game log data, and access to a storage location to store results.
Additionally, a connection to the Redis DB for the job queue is required.
Access to Redis and to game log providers is granted via a docker network, a storage is mounted with a writable docker volume.

\subsection{User interface: Flask}
The user interface needs to be available to the public, and needs to be attached to the Redis DB to append analysis jobs to the job queue.
In order to use the celery API, it too has to include the whole analysis project.

Therefore it is appropriate to use a single docker image for both the celery and the flask container.
Although it would be possible to use separate images without much overhead in disk space\footnote{
	Docker saves each step defined in the Dockerfile as layer.
	Using such a layer as basis for another image allows to ship additions with only the difference layer.
	Unfortunately, each additional layer consumes more space, and optimizations like removal of build-time requirements may lead to increased runtime overhead when building then images.
	},
this reuse with less dependencies helps to keep development on track.
The image itself is rather straightforward.
With an Alpine Linux\furl{https://alpinelinux.org/} image as basis, build-time and runtime dependencies are installed with Alpine's packet management system.
Then the Python libraries are installed using pip, and the build-time requirements are cleared.
To reduce the size of the image, once these steps are working they are combined into a single layer.

Using docker labels, the container is flagged to be exposed using Traefik (see \autoref{sec:srv-traefik}).

\subsection{Result server: Nginx}
To serve the static result files, a simple HTTP server is required.
With its low footprint on memory, storage and CPU, Nginx is a suitable solution.

Equipped with a data volume, again labels mark this container to be exposed.

\subsection{Database: Redis}
Redis is one of the recommended backend storages for celery.
It was chosen due to the simple integration into this environment.

Running in the docker network, the only configuration is the volume for persisting the data across service and system restarts.

\subsection{Geogame Log file provider}\label{sec:ggt-server}
To provide an HTTP interface for geogames without a permanent game server, this service does not need to be public.
With an already integrated HTTP server running Nginx, it is obvious to reuse this image, too.

This service, however does need a little configuration:
To avoid parsing HTML index sites or generating metadata indices, the autoindex feature of Nginx is used.
With the format option\furl{http://nginx.org/en/docs/http/ngx_http_autoindex_module.html\#autoindex_format}, this delivers JSON data instead of HTML, leading to a much more pleasant client.

\subsection{BioDiv2Go Server}
To integrate nicely into the project and the development machines used during this thesis, the BioDiv2Go server was packaged into docker containers, too (see \autoref{app:biogames}).

\subsection{Frontend \& Reverse Proxy: Traefik}\label{sec:srv-traefik}
Traefik\furl{https://traefik.io/} is a reverse proxy.
It offers integration in service orchestration systems like Docker, Swarm, Kubernetes.
With few lines of configuration, it detects new services automatically, and can create appropriate SSL/TLS certificates on the fly via Let's encrypt.

Here, it is configured to watch docker containers, and create forwarding rules for those marked with docker labels.
For fine-grained control, the creation of default forwards, is disabled, so only explicitly marked containers are subject to this automatic proxy.
The label \texttt{traefik.enable=true} enables Traefik's reverse proxy pipeline for this container, while \texttt{traefik.port=8080} documents the port where the container exposes its service.

The proxy rule to forward traffic to this container is configured with \texttt{traefik.frontend.rule= Host:select.ma.potato.kinf.wiai.uni-bamberg.de}.
Here Traefik supports a wide range of options\furl{https://docs.traefik.io/basics/\#frontends}, including grouping by any or all semantics with multiple rules.

For the purposes of this project, a wildcard domain record was used for the development machine, so each service can be accessible with an own subdomain.

See also for an example configuration: \autoref{app:traefik}.

\subsection{Service composition and management}
\autoref{img:arch} shows the integration of the above described services into one solution.
This structure is fixed in a Docker-compose\furl{https://docs.docker.com/compose/} setup (see \autoref{code:gglap}).

The advantage of docker-compose is the definition of all images, volumes and networks in a single file.
When a scenario with high load occurs, this definition allows for simple scaling.
To create more celery worker nodes, issuing the command \textit{docker-compose scale worker=8} suffices to create 8 worker containers running in parallel.

\image{\textwidth}{architecture.pdf}{Service composition overview}{img:arch}