Project Ideas

The following is a collection of our project ideas. If you have further project ideas and/or want to discuss some of our proposed projects please contact us via e-mail. Feel free to subscribe to our mailing list and follow the discussions to see what is going on.

1. Review of Coding Strategies

tudocomp currently provides only a canonical Huffman coder and a customized static low entropy coder.

Novel coding algorithms provide better precision than a Huffman coder and are faster than arithmetic coders. It is interesting to integrate general codings like ASM or FSE or specialized codings like ETDC into tudocomp and evaluate all codings with respect to their compression ratio and coding/decoding time.

References:

Category: Encoding

2. The AreaComp Compression Algorithm

The idea of AreaComp is to substitute frequent large substrings of a text. We search for the substring that maximizes the value of a cost function. The cost function weights the number of occurrences and the length of a substring, e.g., the multiplication of the length and the number of occurrences of a substring is a natural choice.

Given the suffix array and the longest common prefix array, we can find the number of occurrences of a substring in the text by looking at both arrays.

A naive approach would be to store all substrings of a certain length occurring at least twice in a priority queue, with its cost function value as its key. We pop the topmost element (i.e., the best substring w.r.t. the cost function) off from the heap, substitute its occurrences, and update the suffix array and the longest common prefix array. There is a similar method called "greedy off-line textual substitution" that considers all non-overlapping occurrences.

Category: Compression

References:

3. Clever Tie Breaking for lcpcomp

Our compressor lcpcomp (implemented in tudocomp) is a longest-first greedy compression algorithm.

Given multiple longest substrings, there is no strict tie breaking rule that states which of the substring to select for substitution. The focus of this project is to enhance the lcpcomp compressors with a heuristic for which substring lcpcomp should choose. The heuristic can be based on the selection of a tie breaking rule with the best expectation in terms of compression ratio or decompression speed.

Category: Algorithm Engineering

References:

This problem is similar to a semi-greedy variant of LZ77.

4. k-maxsat for lcpcomp

Decompressing an lcpcomp compressed file is a heavy task with respect to time. That is because references of lcpcomp can be nested, i.e., a reference can refer to a substring that got replaced with another reference.

The nesting of references can form long dependency chains that need to be resolved before the actual decompression can take place. This project focuses on a modification of the compression strategy, where we want to check whether it is possible to cirumvent the production of long dependency chaines.

It can be shown that this problem is related to the k-maxsat problem. The aim of this project is to devise an approximation algorithm for the k-maxsat problem to prevent long dependency chains.

Category: Algorithm Engineering

References:

Mayr, Ernst W., Prömel, Hans Jürgen, Steger, Angelika: "Lectures on Proof Verification and Approximation Algorithms", the MaxEkSat-Problem

5. Efficient Integer Coders

The aim of this project is to implement and evaluate a fast Fibonacci encoding algorithm.

Fibonacci coding is a universal code that represents integers succinctly. The coding splits an integer into summands that are Fibonacci words. Although the coding achieves a very compact representation, its decoding is slow. In this project, the encoding shall be implemented and benchmarked in tudocomp, i.e., the goal is to measure its speed and compare to currently avaiable Fibonacci coders.

Category: Encodings

References:

6. LZ78 with a Compact Hash Table

Our LZ78 compressor can utilize different Lempel-Ziv-78 tries, e.g., a binary trie, a ternary trie, or a trie based on a hash table. The latter is the fastest, but heaviest trie implementation.

In the light of compact hash tables, we wonder whether we can drop the memory footprint of hash table implementations while still being very fast. The goal of this project is to research on this topic and develop a new, memory-efficient LZ78 trie based on a compact hash table.

Don't Thrash: How to Cache Your Hash on Flash

Category: Compression, Hashing

7. Web Application for Visual String Analysis

While working with lossless compression algorithms on texts, we often experience the lack of tools that visualize text index data structures.

There are some tools that provide limited insight to some data structures. However, there is no powerful and easy-to-use tool that covers a majority of the most frequently used data structures.

The aim of this project is to produce a web application (preferably based on JavaScript) that interactively visualizes the most commonly used data structures like suffix arrays, longest common prefix arrays, etc., with respect to text compression.

References:

Category: Visualization, Web Development

8. 7zip-compatible Output Format

To improve the usability of the tudocomp framework, we want the tudocomp output format to become compatible to 7zip. The 7z format supports various compression techniques due to a versatile header, describing the exact used compression technique.

This project's aim is to adapt the 7z format for the tudocomp command line tool.

Category: Interoperability

References:

7zip

9. Graphical User Interface

The tudocomp framework provides only a command line tool as an interface to the end user. An graphical user interface would benefit the project for addressing users with antipathies to command line tools.

The GUI should provide the selection of multiple files/directories and an easy way to assemble a custom compression pipeline using what is available in tudocomp. It should be as platform independent as possible and usable on any platform supported by tudocomp.

The GUI can be developed, for instance, using a framework like Qt.

Category: GUI Development

10. Variants of LZ78U

Our LZ78U compressor currently uses the class cst_sada of the SDSL-lite library to build a suffix tree. Alternative suffix tree consruction data structures could be faster/memory friendlier.

Another interesting promlem is how the factorization of the factor labels of the LZ78U should be done. Currently, we partition an LZ78U factor label in characters and former LZ78U factors, greedily chosen from left to right. The factorization of a factor label does not introduce new LZ78U factors. If it would, then there is a need for the nested/recursive factorization of factor labels created during the factorization of a factor label. It could be that this improves the compression ratio.

Category: Compression

Project Ideas

Overview

General Information

Communication

Project Plan

Project Ideas

1. Review of Coding Strategies

2. The AreaComp Compression Algorithm

3. Clever Tie Breaking for lcpcomp

4. k-maxsat for lcpcomp

5. Efficient Integer Coders

6. LZ78 with a Compact Hash Table

7. Web Application for Visual String Analysis

8. 7zip-compatible Output Format

9. Graphical User Interface

10. Variants of LZ78U