Discoverable algorithms in C++

Discoverable algorithms in C++

TL;DR

Searching  for algorithms based on the signatures of the transformations they can apply to data is invaluable.

For example, this link demonstrates a search engine used to discover all the available algorithms in Haskell for transformation given a signature "(int, [a]) -> [a]".

In C++, one library offers such a facility : FunctionalPlus.  Access its API search page and enter this in the search box:

(int, [a]) -> [a]

... and you will find many useful algorithms based on this signature (drop, repeat, stride, drop_idx, take_cyclic, partial_sort, etc, etc.).

I wish more libraries would follow this idea, since reasoning about values and their transformations is a powerful way to render the code more readable and correct at the same time.

A short C++ coding session using this technique

As an illustration, the video below shows a development session where a simple program is developed using such a search facility:

Links to solutions on Compiler Explorer:
Version with named variables: https://godbolt.org/z/YnK4bj
Version with a composed function: https://godbolt.org/z/n3Yh6s

Goal: no raw loop

Sean Parent said:

If you want to improve the code quality in your organization, replace all of your coding guidelines with one goal: no raw loop

(See C++ Seasoning video at 8'44")

In his mind, a "goal" shall not be thought as a "one must" or a "one must not",
but rather as a target you set out to follow to as much as possible.

In this context, having a search tool that helps your team to find the needed algorithms in order to avoid raw loops is invaluable IMHO. The previous video is a demonstration centered on this topic.

An old programing pearl

This  video is based on famous problem that was posed to Donald Knuth in 1986:

Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.

Donald Knuth proposed a very advanced solution, demonstrating his style of “Literate Programming” through the WEB system. In his solution, he introduced innovative data structures (hash tries, etc.).  His solution was however quite lengthy, and limited to text documents: it was not so easy to turn it into a solution that could be adapted to similar problems.

In his review of Knuth's solution, Doug Mcllroy acknowledges the brilliance of the solution, and the elegance of the WEB system for literate programming. However, he proposed an alternate solution in 6 lines using standard Unix utilities:

c

For those interested in the original discussion, refer to a pdf of the original article.

The unix shell solution is surprisingly elegant, as it solves the problem in simple and composable transformations. However, it can also be a little intimidating, and one might wonder "am I supposed to learn all those cryptic commands (sed, uniq, tr, etc.) and their arguments by heart?

The same is true for C++: how do we find the equivalent algorithms and transformations,that would be needed in order to reproduce such an elegant solution?

Discoverable algorithms  and transformations

Identifying an algorithm's signature

Let's examine one of those shell transformations: "uniq -c": it will replace each sequence of consecutive identical lines with one line that gives i) the number of lines and ii) the content of the line. In C++ terms, it could have the following signature:

vector < pair <size_t, string> > uniq_c(const vector<string>& v) { ... }

In more generic terms we could look for an algorithm with the following signature:

template<typename T>
vector< pair<size_t, T> > uniq_c(const vector<T>& v) { ... }

And in a more language agnostic way, we would say that the type of the signature of "uniq_c" is:

[a] -> [ (int, a) ]

i. e, given an array of type "a" (where a is a string for example), it returns an array of pair of type (number_of_occurrence, value of type a).

Searching for algorithms based on their signatures

In C++ , finding the good algorithm can sometimes be difficult. Knowing where to find the algorithm you need, and finding its name is all too often a barrier.
Jonathan Boccara has provided a helpful map, of which a picture can be found here, but apart from this, the search options are limited.
You are supposed to learn the different algorithms: this is a required part of the C++ background (and rightly so).

However, searching for algorithms based on their signature is surprisingly efficient. Let's for example think for one moment of all the algorithms that take as input an integer + an array, and return a modified array as an output.
Their signature is:

(int, [a]) -> [a]

Haskell offers a search engine, named Hoogle. With this tool, you would search for "int->[a]->[a]" (the signature is a bit different, since Haskell uses currying by default, which is beyond the scope of this post). And the results can be seen with this link.  

In C++, one library offers such a facility : FunctionalPlus. It's api search, is located at http://www.editgym.com/fplus-api-search/. Open the link and enter "(int, [a]) -> [a]" in the search box, and you will find the following list:

    (int,[a])->[a]
    
        drop : (Int, [a]) -> [a]
            Skip the first n elements of a sequence xs.
            If n > length(xs) an empty sequence is returned.
            drop(3, [0,1,2,3,4,5,6,7]) == [3,4,5,6,7]
            Also known as skip.
        
        take : (Int, [a]) -> [a]
            Return the first n elements of a sequence xs.
            In case n >= length(xs), xs is returned.
            take(3, [0,1,2,3,4,5,6,7]) == [0,1,2]
            take(10, [0,1,2]) == [0,1,2]
        
        repeat : (Int, [a]) -> [a]
            Create a sequence containing xs concatenated n times.
            repeat(3, [1, 2]) == [1, 2, 1, 2, 1, 2]
        
        stride : (Int, [a]) -> [a]
            Keeps every nth element.
            stride(3, [0,1,2,3,4,5,6,7]) == [0,3,6]
            Container stride(std::size_t step, const Container& xs)
        
        shuffle : (Int, [a]) -> [a]
            Returns a randomly shuffled version of xs.
            Example call: shuffle(std::mt19937::default_seed, xs);
            Example call: shuffle(std::random_device()(), xs);
        
        drop_idx : (Int, [a]) -> [a]
            Remove the element at a specific index from a sequence.
            drop_idx(2, [1,2,3,4,5,6,7]) == [1,2,4,5,6,7]
        
        drop_last : (Int, [a]) -> [a]
            Skip the last n elements of a sequence xs.
            If n > length(xs) an empty sequence is returned.
            drop_last(3, [0,1,2,3,4,5,6,7]) == [0,1,2,3,4]
        
        take_last : (Int, [a]) -> [a]
            Return the last n elements of a sequence xs.
            In case n >= length(xs), xs is returned.
            take_last(3, [0,1,2,3,4,5,6,7]) == [5,6,7]
            take_last(10, [0,1,2]) == [0,1,2]
        
        drop_exact : (Int, [a]) -> [a]
            Skip exactly the first n elements of a sequence xs.
            Unsafe! Crashes when xs is too short.
            drop_exact(3, [0,1,2,3,4,5,6,7]) == [3,4,5,6,7]
            drop_exact(10, [0,1,2,3,4,5,6,7]) == crash
        
        take_exact : (Int, [a]) -> [a]
            Return exactly the first n elements of a sequence xs.
            Unsafe! Crashes then sequence is too short.
            take_exact(3, [0,1,2,3,4,5,6,7]) == [0,1,2]
            take_exact(10, [0,1,2]) == crash
        
        take_cyclic : (Int, [a]) -> [a]
            Takes n elements from a sequence considering it as cyclic.
            take_cyclic(5, [0,1,2,3]) == [0,1,2,3,0]
            take_cyclic(7, [0,1,2,3]) == [0,1,2,3,0,1,2]
            take_cyclic(7, [0,1]) == [0,1,0,1,0,1,0]
            take_cyclic(2, [0,1,2,3]) == [0,1]
            take_cyclic(3, [0]) == [0,0,0]
            take_cyclic(3, []) == crash!
            Also known as take_wrap.
            xs must be non-empty.
        
        partial_sort : (Int, [a]) -> [a]
            Partially sort a sequence in ascending order using std::less.
            Returns only the sorted segment.
        
        replicate_elems : (Int, [a]) -> [a]
            Replicate every element n times, concatenate the result.
            replicate_elems(3, [1,2]) == [1, 1, 1, 2, 2, 2]
            Container replicate_elems(std::size_t n, const Container& xs)

A fast and elegant to search for algorithms and transformation!

Re-discovering RLE (run-length-encode)

Let's go back to "uniq -c".  If we go back to Fplus's search engine, and search for
[a] -> [ (int, a) ]
we discover the following functions.

    [a] -> [(Int, a)]
    
    enumerate : [a] -> [(Int, a)]
        Attach its index to every element of a sequence.
        enumerate([6,4,7,6]) == [(0, 6), (1, 4), (2, 7), (3, 6)]
    
    run_length_encode : [a] -> [(Int, a)]
        run_length_encode([1,2,2,2,2,3,3,2)) == [(1,1),(4,2),(2,3),(1,2)]


And, that's it!
Run length encoding is exactly what "uniq -c" is doing. We did not know the name of the algorithm, but the search engine geared us in the good direction.

I wish other libraries would follow this idea of being "searchable by signature".

Composing algorithms

Composing algorithms is a powerful way to develop new features, and to be more certain of the correctness of your program.

For comparison, the shell script:

tr -cs A-Za-z '\n' |  # 1 word per line
tr A-Z a-z |          # lowercase
sort |                # sort words (with possible repetitions)
uniq -c |             # group identical words and keep and count per group
sort -rn |            # numerical sort, inverted
sed $1q               # Display the n first lines

becomes, when translated to C++, using similar transformations:

std::function<string(string)> make_word_stats_processor(size_t nb_results)
{
  return fplus::compose(
      fplus::fwd::split_words(false)
    , fplus::fwd::transform(fplus::to_lower_case<string>) 
    , fplus::sort<std::vector<std::string>>
    , fplus::run_length_encode<std::vector<string>>
    , fplus::fwd::sort_by(greater<pair<size_t, string>>())
    , fplus::fwd::transform(show_word_count)
    , fplus::fwd::take(nb_results)
    , fplus::fwd::join(std::string("\n"))
  );
}

I'll be brutally honest about it: when writing such code, you will inevitably make type errors which the C++ compiler will "reward" with cryptic template error messages.  The first attempts might be painful.
However, the final code is very readable IMHO, and most of the time, a code that compiles will be correct at the first run!
Moreover, parallelization is then almost free (see fplus::transform_parallely for example).

Using pure higher order functions: an example with "interact"

fplus::interact enables to transform a string->string function into a complete stdin/stdout program. It is thus a higher order function that takes a function as parameter, and returns another  function.

Here is its signature:

interact : (String -> String) -> Io ()

Takes a function F of type (String -> String)
and returns a function that
reads the entire input from standard input,
passes it through the given function,
and writes the result to standard output.

The Io() at the end of the signature signals that the returned function will be able, when called, to perform side effects (e.g read from stdin and write to stdout).

However, fplus::interact is in itself a pure function. It will not perform the side effect by itself, since you have to run the returned function in order to perform the side effect. This way, this function remains pure and composable.

About Pascal Thomet

I am Pascal Thomet (pthomet -at- gmail -dot- com). I live in Paris, where I was the co-founder of IVS (http://www.activisu.com). I always had an interest in transmitting knowledge, hence this blog.

Comments