Open thesis topics
Bachelor
Earth Observation (EO) data cubes are multidimensional arrays of spatial and temporal data, crucial for monitoring and analyzing environmental changes. Machine learning (ML) models applied to EO data cubes enable advanced data analysis and predictive capabilities. However, the diversity of programming languages used in the spatial data science and geoinformatics community, particularly R and Python, poses challenges for interoperability and reproducibility of these ML models.
The outcomes of this research are expected to facilitate smoother integration and collaboration among spatial data scientists and geoinformatics professionals who rely on different programming environments, promoting the reproducibility and interoperability of EO data analysis projects. This work will contribute to the broader goal of advancing geospatial data science by bridging the gap between diverse computational ecosystems.
Use Case:
Carrying out spatial-temporal analysis, such as time-series crop classification in Germany, leveraging the ONNX interoperability format : https://onnx.ai/
- Model Portability: How can a deterministic machine learning model, such as Support Vector Machine (SVM), be trained on Earth Observation data cubes in Python and then ported to R using the ONNX format?
- Performance Evaluation: What are the differences in performance and accuracy of the SVM model when ported from Python to R for time-series crop classification in Germany
- Interoperability Challenges: What are the challenges and potential solutions in ensuring interoperability and reproducibility of machine learning models between Python and R programming environments using ONNX?
- What is the feasibility of implementing identical deep learning models for Earth Observation data cubes in R, Python, and Julia, and ensuring their interoperability?
- How do the available tools and libraries for machine learning in R, Python, and Julia compare in terms of ease of use, performance, and integration with EO data cubes?
- What are the differences in command structure and interface among R, Python, and Julia for machine learning tasks related to EO data cubes, and how do these differences impact the reproducibility and interoperability of the models?
Please contact:
Brian Pondi brian.pondi@uni-muenster.de
and
Edzer Pebesma edzer.pebesma@uni-muenster.de
Contact: Brian Pondi
Master
Earth Observation (EO) data cubes are multidimensional arrays of spatial and temporal data, crucial for monitoring and analyzing environmental changes. Machine learning (ML) models applied to EO data cubes enable advanced data analysis and predictive capabilities. However, the diversity of programming languages used in the spatial data science and geoinformatics community, particularly R and Python, poses challenges for interoperability and reproducibility of these ML models.
The outcomes of this research are expected to facilitate smoother integration and collaboration among spatial data scientists and geoinformatics professionals who rely on different programming environments, promoting the reproducibility and interoperability of EO data analysis projects. This work will contribute to the broader goal of advancing geospatial data science by bridging the gap between diverse computational ecosystems.
Use Case:
Carrying out spatial-temporal analysis, such as time-series crop classification in Germany, leveraging the ONNX interoperability format : https://onnx.ai/
- Model Portability: How can a deterministic machine learning model, such as Support Vector Machine (SVM), be trained on Earth Observation data cubes in Python and then ported to R using the ONNX format?
- Performance Evaluation: What are the differences in performance and accuracy of the SVM model when ported from Python to R for time-series crop classification in Germany
- Interoperability Challenges: What are the challenges and potential solutions in ensuring interoperability and reproducibility of machine learning models between Python and R programming environments using ONNX?
- What is the feasibility of implementing identical deep learning models for Earth Observation data cubes in R, Python, and Julia, and ensuring their interoperability?
- How do the available tools and libraries for machine learning in R, Python, and Julia compare in terms of ease of use, performance, and integration with EO data cubes?
- What are the differences in command structure and interface among R, Python, and Julia for machine learning tasks related to EO data cubes, and how do these differences impact the reproducibility and interoperability of the models?
Please contact:
Brian Pondi brian.pondi@uni-muenster.de
and
Edzer Pebesma edzer.pebesma@uni-muenster.de
Contact: Brian Pondi
- How do TensorFlow and PyTorch compare in terms of model fitting correspondences and the structure of intermediate layers when applied to EO data cubes?
- What are the differences in the interface and tooling availability between TensorFlow and PyTorch for spatio-temporal modeling, e.g. using ConvLSTM?
- How does the performance and interoperability of spatio-temporal deep learning models vary across different versions of TensorFlow and PyTorch when applied to EO data cubes?
NB : ONNX format can be used to port DL models between Tensorflow and Torch ; https://onnx.ai/
Please contact:
Brian Pondi brian.pondi@uni-muenster.de
and
Edzer Pebesma edzer.pebesma@uni-muenster.de
Assigned thesis topics
Bachelor
With climate change more frequent and more intense forest fires pose a serious threat to the environment and societies. In a collaboration between the Institute of Landscape Ecology and the Institute of Geoinformatics we aim to extend a web-based burn simulator (known as Ember-sim) to simulate the spread of fire across the landscape based on established fire behaviour models. The burn simulator will then be used for educational purposes, research, and training professional fire practitioners, governmental officers and volunteers from the community.
The project can be delivered at the BSc or MSc level and the potential candidate will be in charge of extending Ember-sim originally developed for the Australian continent, for Germany. Project starting date is open until position is filled.
Are you experienced with JavaScript and have interest in climate change-related topics?
Please contact:
Prof. Mana Gharun mana.gharun@uni-muenster.de
Or
Dr Christian Knoth christian.knoth@uni-muenster.de
Author: Ahmed Aly
Supervisor: Mana Gharun, Christian Knoth
Master
With climate change more frequent and more intense forest fires pose a serious threat to the environment and societies. In a collaboration between the Institute of Landscape Ecology and the Institute of Geoinformatics we aim to extend a web-based burn simulator (known as Ember-sim) to simulate the spread of fire across the landscape based on established fire behaviour models. The burn simulator will then be used for educational purposes, research, and training professional fire practitioners, governmental officers and volunteers from the community.
The project can be delivered at the BSc or MSc level and the potential candidate will be in charge of extending Ember-sim originally developed for the Australian continent, for Germany. Project starting date is open until position is filled.
Are you experienced with JavaScript and have interest in climate change-related topics?
Please contact:
Prof. Mana Gharun mana.gharun@uni-muenster.de
Or
Dr Christian Knoth christian.knoth@uni-muenster.de
Author: Ahmed Aly
Supervisor: Mana Gharun, Christian Knoth
Data cubes are an efficient representation for spatiotemporal data as from Earth observation satellites. By multidimensional chunking, they allow highly parallel execution of complex analysis such as time series change detection. The aim of the thesis is to create an architecture that allows for scaling such operations in a distributed computing environment using containerized processing (Docker) and tools for container orchestration such as Kubernetes. A prototypical implementation as extension to the gdalcubes library shall be developed and used for a detailed analysis of the scalability of the proposed architecture.
Author: Maria HidalgoSupervisor: Edzer Pebesma
Completed thesis topics
Bachelor
Climate change-driven shifts in streamflow timing have been documented for Western North America and are expected to continue with increased warming. These changes will likely have the greatest implications on already short and overcommitted water supplies in the region. This study investigated changes in Western North American streamflow timing over the 1948 – 2008 period, including the very recent warm decade not previously considered, through a) trends in streamflow timing measures b) two second order linear models applied simultaneously over the region to test for the acceleration of these changes, and c) changes in runoff regimes. Basins were categorized by the percentage of snowmelt derived runoff to enable the comparison of groups of streams with similar runoff characteristics and to quantify shifts in snowmelt-dominated regimes.
Results indicate that streamflow has continued to shift to earlier in the water year, most notably for those basins with the largest snowmelt runoff component. However, an acceleration of these streamflow timing changes for the recent warm decades is not clearly indicated. Most coastal rain-dominated and some interior basins have experienced later timing. The timing changes are connected to area-wide warmer temperatures, especially in March and January, and precipitation shifts that bear sub-regional signatures. Notably, a set of the most vulnerable basins has experienced runoff regime changes, such that basins that were snowmelt dominated at the beginning of the observational period shifted to mostly rain dominated in later years. These most vulnerable regions for regime shifts are in the California Sierra Nevada, eastern Washington, Idaho, and north-eastern New Mexico. Snowmelt regime changes may indicate that the time available for adaptation of water supply systems to climatic changes in vulnerable regions are shorter than previously recognized.
Author: Holger FritzeSupervisor: Edzer Pebesma
Map matching is the matching of GPS trajectories to existing road segments. Several algorithms have been proposed and evaluated for this. This thesis will create and document an open source implementation of one of the more successful of these algorithms.
On the bachelor level, a simple implementation in a new or existing R package is required. On the Master level, a comparison of different modelling approaches and implementation forms is additionally required.
Requirement: experience with R and R programming.
Resulting package:
http://cran.r-project.org/web/packages/fuzzyMM/index.html
Author: Nikolai GorteSupervisor: Edzer Pebesma
The Global Positioning System (GPS) is widely used and is a major positioning technology for land vehicle navigation. However it is not 100% accurate, which is a problem for any kind of navigation system. There are several factors that contribute to positioning errors, e.g., satellite related errors, propagation related errors, receiver related errors, GPS signal masking or blockage and satellite geometric contribution to position error (Quddus, 2006). This is where map matching comes in.
Map matching is the process of matching GPS trajectories to a digital road network. This is done by map matching algorithms. In Quddus (2006) several of the existing map matching algorithms are discussed and three improved map matching algorithms are introduced. One of the more successful algorithms is the fuzzy logic map matching algorithm, whose implementation is the main part of the bachelor thesis.
The above mentioned map matching algorithm is implemented in R (R Core Team, 2013) and therefore it is open source. The testing of the algorithm is done using _eld data acquired from the enviroCar1 project.
The result of this bachelor thesis is a documentation and an R package providing functions which allow the user to match their GPS trajectories to a digital road network using the fuzzy logic map matching algorithm and also allow the customization of the parameters used in the membership functions in the fuzzi_cation process.
Author: Nikolai GorteSupervisor: Prof. Dr. Edzer Pebesma
Download thesis PDF
43 Millionen Personenkraftwagen nehmen am Straßenverkehr in Deutschland teil, wovon 95% auf fossile Brennstoffe angewiesen sind [BMU 2010][KBA 2013]. Das starke Verkehrsaufkommen, vor allem in Städten, hat negative Folgen für die Umwelt und die Gesundheit der dort lebenden Menschen. So hatte im Jahr 2004 allein in Deutschland der Straßenverkehr einen Anteil von 20% am Gesamtvolumen der direkten CO2-Emissionen [BMU 2008]. Im Vergleich zum Jahr 2000 ist der Kraftfahrzeugbestand in Deutschland um 13,2% angestiegen. Während die Verkehrsinfrastruktur im überörtlichen Bereich bei Kreisstraßen und Autobahnen nur geringe Zunahmen, bei Bundesstraßen sogar Verluste erzielt [DeSTATIS 2012a][DeSTATIS 2012b]. Die Verluste bei Bundesstraßen und Zugewinne bei Kreisstraßen sind meist auf regionale Änderungen der Verkehrsinfrastruktur zurückzuführen. So können Bundesstraßen beispielsweise zu Kreisstraßen umstrukturiert werden [DBT 2013]. Innerorts sind größtenteils nur durch die Erschließung neuer Wohn- und Gewerbegebiete Zuwächse der Verkehrsinfrastruktur zu beobachten [ADAC 2008]. Eine zusätzliche Belastung für die Verkehrswege ist die steigende Fahrleistung der Personenkraftfahrzeuge. Diese betrug im Jahr 2011 schätzungsweise 610 Milliarden Kilometer [DIW 2012]. Dies belastet vor allem innerörtliche Verkehrswege, welche Aufgrund der umliegenden Bebauung nicht weiter ausgebaut werden können. Zunehmenden Einfluss erhalten daher Maßnahmen zur Veränderung des Verkehrsverhaltens, beispielsweise durch Einsatz von Umweltzonen oder Geschwindigkeitsbegrenzungen, um den Verkehrsfluss zu optimieren.
Um Prognosen und Verkehrsplanungen durchführen zu können, bedarf es einer zuverlässigen Erhebung von Informationen zum Straßenverkehr in Deutschland. Hierzu gibt das Bundesministerium für Verkehr, Bau und Stadtentwicklung das sogenannte deutsche Mobilitätspanel in Auftrag. Es finden seit 1994 jährlich und in dreijähriger Begleitung, Befragungen von knapp 2000 Personen zu ihrem Verkehrsverhalten statt. Unter anderem werden Informationen über die Länge der Wegstrecken sowie Kosten und Kraftstoffverbrauch, erhoben [DIW 2012]. Anhand dieser Erhebung kann nicht nur die Fahrleistung in Kilometern geschätzt werden, sondern auch der durchschnittliche und gesamte Kraftstoffverbrauch und die Schadstoffemission der Fahrzeuge. Der Gesamtverbrauch an Kraftstoff wird ebenfalls aus dem an den Tankstellen abgesetzten Volumen Kraftstoffe, sowie auf Grundlage von Verbrauchsangaben der Fahrzeughersteller und Automobilzeitschriften geschätzt [DIW 2004].
Gerade der Energie- und Kraftstoffverbrauch ist wichtig für die Berechnung der CO2-Emissionen und bilden eine weitere Informationsgrundlage zur Alltagsmobilität für die Stadt- und Verkehrsplanung. Die langfristigen Ziele in der Umweltpolitik sind daher, eine Senkung der CO2- und Schadstoffemissionen und einen geringeren Ressourcenverbrauch im Personenkraftverkehr zu erreichen. In vielen städtischen Bereichen werden daher Monitoringsysteme zur Überwachung der Luftqualität oder der Verkehrsstärke eingesetzt, um regelmäßig wichtige Umweltinformationen erfassen, die als Grundlage umweltpolitischer Entscheidungen dienen. An diesem Punkt setzt das EnviroCar-Projekt an.
Author: Julius WittkoppSupervisor: Prof. Dr. Edzer Pebesma
Download thesis PDF
This bachelor thesis deals with an approach for the estimation of population in small areas. The approach was developed by Klaus Steinnocher (Steinnocher et al., 2006, Steinnocher et al., 2011) and relies on the assumption that the population density is proportional to the degree of soil sealing. Population data on the community level is disaggregated to a level of electoral districts for the whole region of Münster. The "EEA Fast Track Service Precursor on Land Monitoring" dataset represents the degree of soil sealing in 20x20m grid cells and is used as data basis for this approach. The CORINE Land Cover (CLC) dataset is used to mask those areas from the EEA dataset that are not used for residential purposes. For the same reason an OpenStreetMap dataset is used to mask the streets from the EEA dataset. The approach from Steinnocher is applied to the remaining areas which represent the living space as good as possible. The calculated population data is compared to the reference population data on a level of electoral districts. Differences and tendencies are discussed. Furthermore the realized approach from Steinnocher (Steinnocher et al., 2006, Steinnocher et al., 2011) is compared to another approach developed by Francisco Javier Gallego (Gallego et al., 2001, Gallego et al., 2010). Differences and characteristics of the two approaches are discussed.
Author: Lars SyfußSupervisor: Prof. Dr. Edzer Pebesma
Download thesis PDF
Snowmelt derived water is of great importance for most streams across Western North America. A changing climate affects streamflow and changes its intraannual contribution. Shifts of the center of timing (CT) and the start of snowmelt pulse towards earlier in the year have already been detected for the 1948-2000 period. While the trends for CT have increased for the 1948-2008 period, the ones for the snowmelt pulse did not appear to have accelerated. In contrast the number of snowmelt pulses decreased within the same period and indicated that more winter precipitation came as rain rather than snow. Based on the ratio of years with and without snowmelt pulses this study has developed a measure of snowmelt domination and classified the streams into four categories. These categories were used to compare groups of streams with similar runoff characteristics and to quantify shifts in snowmelt domination regimes. Furthermore the data and measures were interactively visualized on a virtual globe using Nasa World Wind Java.
Author: Holger FritzeSupervisor: Prof. Dr. Edzer Pebesma
Download thesis PDF
- Monitoring Violent Conflicts: Web-mapping platform to combine automatic and manual image analysis STML
Deutsch:
Die große Zahl gewaltsamer Konflikte weltweit und das Ausmaß, zu dem Menschenrechte hierbei verletzt werden, machen eine genaue Überwachung und Dokumentation von Konflikten unabdingbar. Da eine ausführliche bodengebundene Überwachung des Kriegsverlaufes und seiner Auswirkungen jedoch -insbesondere in abgelegenen Regionen- häufig kaum möglich ist, werden Fernerkundungsmethoden und GI-Technologien immer häufiger dazu eingesetzt, Kampfhandlungen in Kriegsgebieten zu dokumentieren. Satellitenbilder können zum Beispiel visuellen Zugang zu schwer erreichbaren Regionen ermöglichen und lokale Berichte über Gewalt und Zerstörung bestätigen. Die meisten praktischen Anwendungen verlassen sich dabei bisher vor allem auf die manuelle Bildanalyse und Identifikation von verdächtigen Objekten (z.B. zerstörte Gebäude). Der Zeit- und Kostenaufwand solcher Analysen ist jedoch erheblich. Eine Möglichkeit, mit dem immensen Aufwand umzugehen, ist die Verteilung der Arbeit auf verschiedene Analysten in sogenannten crowd-sourcing Netzwerken (mit Hilfe von micro-tasking Anwendungen, siehe z. B. http://www.tomnod.com/). Hierbei werden die Fernerkundungsdaten in kleinere Ausschnitte eingeteilt und individuell von Freiwilligen auf z.B. zerstörte Gebäude untersucht. Eine andere Strategie ist die Verwendung (semi-) automatischer Bildanalyse- und Klassifikationsmethoden zur Identifikation von Zerstörungen, um den manuellen Aufwand zu verringern. Momentan konzentrieren sich die verschiedenen Ansätze entweder auf die web-mapping/crowd-sourcing Ansätze oder die Methoden zur automatischen Bildanalyse.
Ziel der Bachelorarbeit ist es, ein prototypisches web-mapping/crowd-sourcing Werkzeug zu entwickeln, dass beide genannten Strategien verbindet. Hierbei sollen bereits vorhandene Ergebnisse aus automatischen Bildanalysen integriert werden, indem sie als Basis für die Erstellung und Priorisierung der Bildausschnitte für die manuelle Analyse dienen. Bildausschnitte mit hoher Wahrscheinlichkeit/Dichte von Zerstörungen (gemäß der Ergebnisse der automatischen Methoden) sollen automatisch höhere Priorität im folgenden manuellen Analyseprozess bekommen. Die zu verarbeitenden Eingabedaten können dabei in unterschiedlichem Detaillierungsgrad vorliegen, z.B. Polygone mit unterschiedlichen Wahrscheinlichkeiten/Dichten von Zerstörung oder einzelne Punkte, die die Position von zerstörten Gebäuden anzeigen. Das zu entwickelnde Werkzeug sollte eine Methode enthalten, um mit Hilfe dieser Daten sinnvoll kleinere Ausschnitte aus den vorliegenden Fernerkundungsbildern zu erstellen und diese zu priorisieren. Zusätzlich sollten Werkzeuge bereitstehen, um Nutzer/innen eine sinnvolle Visualisierung bi-temporaler Daten (vor und nach angeblicher Attacken) und das Markieren von Zerstörungen zu ermöglichen. Optional können natürlich weitere Methoden oder Schnittstellen für die Kombination automatischer und manueller Analyse erdacht und entwickelt werden.
English:
The high number of violent conflicts worldwide and the extent to which human rights are abused during acts of war stress the need for close monitoring and documentation of conflict areas to strengthen public international law. As a comprehensive ground‐level documentation of combat impacts is often hardly possible in conflict areas, satellite imagery and geospatial technology are increasingly being used to document and communicate human rights issues. Satellite images can for example provide visual access to remote or insecure areas as well as visual evidence to corroborate on-the-ground reports on human rights violations. Most of the practical applications rely on the manual image interpretation and identification of objects of interest. However, the time consumption of such analyses is substantial. One strategy to cope with the immense workload is to make use of a decentralized approach and distribute the work among several analysts e.g. within crowd-sourcing networks (by use of micro tasking tools, see e.g. http://www.tomnod.com/). Here the images are divided into subsets and individually investigated by volunteers. Another strategy is to use computer assisted methods for (semi-) automatic information extraction to reduce the analysis workload. Current approaches focus on either web-mapping for collaborative monitoring of violence or on image analysis and classification methods for automatically detecting structural damage in conflict areas.
The aim of this thesis is to develop a prototypical web-mapping and micro-tasking tool for collaborative conflict monitoring which combines both abovementioned fields. It should integrate existing results from automatic classification methods by using them as a basis for the automatic creation and prioritization of areas of interest. Areas with a high probability/density of destruction get a higher priority for the subsequent manual analysis by volunteers. The input data can be in different levels of detail, e.g. polygons of areas with different probabilities of destruction or even point data indicating the location of destructed buildings. The web application should include a method to create and prioritize image subsets based on this input data and contain tools for a meaningful visualization of bi-temporal image data (pre- and post-conflict image) as well as for manually tagging destructed buildings. Optionally, further methods and interfaces combining automatic and manual image analysis can be developed.
Author: Sofian SlimaniSupervisor: Christian Knoth
Supervisor: Marius Appel
- openEO Hub STML
openEO develops an open API to connect R, Python and Javascript clients to big earth observation cloud back-ends in a simple and unified way. Back-ends process user-defined algorithms on remote sensing data sets within their cloud infrastructure. Although the communication between clients and back-ends is standardized by the openEO API, each back-end will implement the API to a different extent and will differ with regards to available processes and data sets. Therefore users should be able to search on a central platform for back-ends that fully support the users requirements. This includes the ability to search for back-ends by
- data sets, e.g. temporal extent, spatial extent, platform, sensor, bands or name,
- processes, e.g. by a process graph provided by the user,
- other back-end related metadata, e.g. API version, capabilities or costs.
Additionally, it could be useful for users to publish and share their algorithms as process graphs or user-defined functions (UDFs) on this central platform.
This thesis should explore, implement and evaluate one or multiple of these aspects. The scope of the thesis is designed to fit the requirements of a bachelor thesis. More information can be found in the openEO Hub GitHub repository.
Contact
- Edzer Pebesma - edzer.pebesma@uni-muenster.de
- Matthias Mohr - m.mohr@uni-muenster.de
Supervisor: Edzer Pebesma
Master
openEO develops an open API to connect R, Python, and Javascript clients to big earth observation cloud back-ends in a simple and unified way. Back-ends process user-defined algorithms on remote sensing data sets within their cloud infrastructure. This thesis will evaluate and implement ways to run openEO user-defined algorithms in a Browser environment, e.g. through JavaScript, so that an algorithm can be fully executed on the client-side for an AOI selected by a user through a map. The required steps to achieve this are as follows:
- A map is shown in the browser and the user navigates to an AOI
- A user can select and load a cloud-native dataset for the AOI, e.g. stored as cloud-optimized GeoTiffs
- An algorithm can be specified through openEO processes and the processing runs in the browser. A set of openEO processes for a use case has to be implemented by the student.
- Finally, the data is visualized using a mapping/visualization library
This thesis should explore, implement, and evaluate one or multiple of these aspects. The scope of the thesis is designed to fit the requirements of a master thesis, but it can probably be split into multiple bachelor thesis, too. More information can be found in the openEO Browser Backend GitHub repository.
Contact
- Edzer Pebesma - edzer.pebesma@uni-muenster.de
- Matthias Mohr - m.mohr@uni-muenster.de
Supervisor: Edzer Pebesma
Supervisor: Edzer Pebesma
The usage of Micro Unmanned Aerial Vehicles (MUAV) as mobile sensor platforms is constantly increasing in the scientific, as well as in the civilian sector. A variety of requirements evolve from upcoming mission tasks like documentation, surveying and inspection in agriculture and geography, as well as in the industry. Many applications, such as the creation of orthoimages or the inspection of industrial plants need accurate position information in real-time, both for safety-in-flight reasons and for enriching sensor data by the provision of location.
As current MUAVs make use of common Global Positioning System receivers and, therefore, do not guarantee reliable high-precision positioning, this work examines the demands on an improved Differential Global Navigation Satellite System (DGNSS) positioning system for its integration into an existing MUAV platform. It proposes a flexible system architecture and presents a modular prototype that offers the possibility to exchange discrete components for making use of more sophisticated technologies like Precise DGNSS. The described prototype already guarantees horizontal positioning accuracy of 35 cm in real-time, which can be considered as sufficient for the majority of applications.
Consequently, this work focuses on the integration of position and additional navigation data into an existing Sensor Platform Framework software, which is able to synchronize sensor and navigation information on-the-fly. It introduces a MUAV platform-specific Input-Plugin for decoding the telemetry data stream and for the communication with the framework. As the framework is able to forward the processed geodata in a standardized way according to the guidelines of the Open Geospatial Consortium Inc., the data can be exploited by any kind of Sensor Web Service in near real-time.
Author: Jakob GeipelSupervisor: Prof. Dr. Edzer Pebesma
Download thesis PDF
Im Arbeitsfeld der reproduzierbaren Forschung werden wissenschaftliche Artikel gemeinsam mit Daten und Programmcode in Form von Kompendien organisiert. Ziel solcher Kompendien ist es, die Daten oder die Analysen austauschbar zu gestalten, sowie den Zugang zu Daten und Software langfristig zu sicherzustellen. Für ein gutes Benutzererlebnis sollte der Austausch von Daten und Analysen zwischen Kompendien einfach und stabil sein, die einen Daten also kompatibel mit dem anderen Code.
Suchdatenbanken, wie zum Beispiel Elasticsearch, spielen beim Auffinden von Dokumenten im Web eine zentrale Rolle. Eine typische Funktion einer Suche ist das Vorschlagen ähnlicher Dokumente auf Basis hochperformanter invertierter Indizes.
Einen ersten Schritt hin zur Kompabilitätanalyse stellen direkte und mittelbare Metadaten dar die in Suchdatenbanken gesammelt werden. Diese Metadaten werden heute meist vom Autor erstellt (abstract, keywords) und nicht umfassend. Auf der Basis von Kompendien können diese und weitere Informationen aus den Sekundärdateien (Daten, Quellcode) abgeleitet werden. Zum Beispiel könnten ähnliche genutzte Softwarekomponenten oder Datenausschnitte einen Hinweis darauf geben, dass zwei gegebene Kompendien so weit kompatibel sind, dass die Daten des einen mit der Analyse des anderen kombiniert werden können.
Ziel dieser Arbeit ist es, die möglichen Quellen von Metadaten wissenschaftlicher Publikationen zu sichten und mit den Anforderungen des Anwendungsfalls zusammen zu führen. Es sollen neue Wege zur Erweiterung, Integration und Vergleich der Metadatensätze entworfen und mittels einer prototypischen Implementierung evaluiert werden.
Die Arbeit kann auf Deutsch oder Englisch verfasst werden.
Author: Lukas Lohoff
Supervisor: Edzer Pebesma
Computational research introduces challenges when it comes to reproducibility, i.e. re-doing an analysis with the same data and code. A current research project at ifgi developed a new approach called Executable Research Compendia (ERC, see https://doi.org/10.1045/january2017-nuest) to solve some of these challenges. ERC contain everything needed to run an analysis: data, code, and runtime environment. So they can be executed “offline” in a sandbox environment. An open challenge is the one of big datasets and reducing data duplication. While the idea of putting “everything” into the ERC is useful in many cases, once the dataset becomes very large it is not feasible to replicate it completely for the sake of reproducibility/transparency and to some extent for archival.
This thesis will create a concept for allowing ERC to communicate with specific data repositories (e.g. PANGAEA, GFZ Data Services) extending on previous work (https://doi.org/10.5281/zenodo.1478542). The new approach should let ERCs “break out” of their sandbox environments in a controlled and transparent fashion, while at the same time more explicitly configuring the allowed actions by a container (e.g. using AppArmor).
Since trust is highly important in research applications, the communication with remote services must be exposed to users in a useful and understandable fashion. Users who evaluate other scientists ERC must know which third party repositories are used and how. The concept must be (i) implemented in a prototype using Docker containerization technology and discussed from viewpoints of security, scalability, and transparency, and (ii) demonstrated with ERC based on different geoscience data repositories, e.g. Sentinel Hub, and processing infrastructure, e.g. openEO or WPS, including an approach for authentication. Furthermore it could be evaluated to define the sandbox more explicitly, and if the communication between ERC and remote service can be captured and then cached for an additional backup, so that future execution may re-use that backup.
Prior experience with Docker is useful but not a strict requirement.
Contact: Daniel Nüst
Supervisor: Daniel Nüst
Download thesis PDF
openEO develops an open API to connect R, Python and Javascript clients to big Earth observation cloud back-ends in a simple and unified way. Back-ends process user-defined algorithms on remote sensing data sets - usually image-based - within their cloud infrastructure. An important aspect is to facilitate users to switch between back-ends easily while still getting consistent and comparable processing results. Back-ends use different IT infrastructure and software to process data although they share the same specification for processes and for communication between clients and back-ends: the openEO API. It is still necessary to ensure that processes comply to the specification. As a consequence, the results from back-ends are often not comparable by default and need to be checked for compliance with the specification. One way to ensure compliance is by processing a certain standardized, reference data sets and validating the results. The openEO project still has to select such data sets. Additionally, the differences in infrastructure and software may eventually lead to at least small differences in the processing results, either due to rounding in floating point arithmetic or implementation details. Therefore there needs to be a certain threshold that the results are allowed to differ. This thesis aims to solve the issues raised by
- defining which aspects an image-based data set need to fulfil for our validation purposes,
- selecting suitable image data sets for validation purposes,
- defining the concrete rules and a workflow for validation,
- and implementing a prototype for the specified workflow.
The scope of the thesis can be adapted to to fit the requirements of either a bachelor thesis or a master thesis. Some more information can be found in the corresponding openEO API GitHub issue.
Contact
- Edzer Pebesma - edzer.pebesma@uni-muenster.de
- Matthias Mohr - m.mohr@uni-muenster.de
Supervisor: Edzer Pebesma
- What R-spatial packages be installed in alternative R implementations?
- What are the main obstacles to a comprehensive geospatial toolset in alternative R implementations?
- What is the role system libraries play in the R-spatial ecosystem from the perspective of alternative R implementations?
- How can containers support transparent benchmarking across R versions and implementations?
Author: Ismail Sunni
Supervisor: Edzer Pebesma