Title

Topic Modelling Of Anthropological Journals Using Latent Dirichlet Allocation

Description

Topic modelling is a text mining technique utilised to discover patterns in textual data. Given a large collection of text documents, a topic model can extract topics that best represent the given content. Based on the frequency and topic assignment of observed words in a corpus of documents, a topic model discovers themes through recurring clusters of words. One of the most widely used topic models is latent Dirichlet allocation (LDA, Blei et al. 2003), which has been applied in various domains such as natural language processing and computer vision. LDA is a generative and probabilistic approach to infer latent topics from a large corpora of text documents. LDA assumes that documents in a corpus consist of a predefined number of topics and that every topic has its own distribution of various words from a fixed vocabulary. By going through an iterative process of finding word co-occurrence for different topic distributions, a hidden topic structure can be extracted from any text corpora. Ever since the LDA model was proposed, countless applications in the text, image and video domains have found LDA to be a useful tool for retrieving and analysing information.

In this thesis, the goal is to evaluate the topic extraction performance of an LDA implementation on a corpus of anthropology-related journals comprising of ancient texts and manuscripts. Manuscript cultures is a publication by the research group ‘Manuscript Cultures in Asia and Africa’ (MCAA) of Hamburg University concerned with the study of ancient and modern written artefacts. The analysis of these publications provides an insight to both general readers and researchers who are interested in the field. Additionally, application of LDA analysis can be useful in the long run to examine research trends in the field of cultural anthropology.

Data set: Manuscript Cultures

Requirements

Probabilistic modelling helpful but not necessary

Person working on it

Nadja Redzuan

Category

Bachelor thesis