Text Indexing
Main Idea
Looking at a huge text or a collection of documents, searching for a pattern naively is not very fast. Especially if there are multiple queries for different patterns, since the text or the documents need to be scanned more than once. An index provides additional information about the text or the documents allowing to search patterns more efficiently. As the index needs to be created only once it can be used for all queries as long as the text or the documents do not change. Still the computation of the index should be reasonably fast and memory efficient as indices are created for huge texts and collections of documents.
In this group we mainly focus on full-text indices which are able to answer every query regardless of the pattern. Popular representatives of this type of indices are suffix arrays and suffix trees.
Current Projects
- Massive Text Indices, which is part of the DFG-SPP 1736 Algorithms for Big Data, since 2014.
- Project Group "SciencePlag", 2015-2016 (in German).
- Project Group "SACABench", starting 2018 (in German).