Concatenated Words

2024-04-28

Problem: You are given a list of words. From the list, find all the words that can be formed by concatenating other words in the list.

Constraints:

The number of words is between 1 and 10^3.
The length of a word is between 1 and 50.
All words are unique.
All words are in lowercase.

Table of contents

Initial thought process

First, we can recognize that some sort of depth first search on a word can reveal whether the word can be partitioned into other words in a list. For example, let’s say that we have a list of words ["high", "way", "highway"]. The word ‘highway’ can be partitioned into two other words in the list, ‘high’, and ‘way’. Consider a composite word such as ‘highway’ as a root node in a tree, and its components ‘high’ and ‘way’ as child nodes. Then, it is easier to see why a depth-first search will work. If we can fully break down a word into such a tree, the word must be a composite word. On the other hand, if we are unable to form a tree out of a word, the word must not a a composite word. Therefore, our initial strategy is to perform a depth-first search on each word to try to break it down into a tree of other words in the list.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class Solution {
    public List<String> findConcatenatedWords(String[] words) {
        List<String> answer = new ArrayList<>();
        for (String word : words) {
            // DFS on each word to break it down
        }

        return answer;
    }
}

While doing a DFS, we will be walking the word character by character, trying to form a word that is already in the list of the words. For example, to check whether ‘highway’ is a composite word, we will be checking ‘h’, ‘hi’, ‘hih’ to see whether they are words in the list. Finally, we will see that ‘high’ is a word in the list. At that point, we will recursively do a DFS on an input ‘way’, solving a sub-problem. All in all, we need a way to efficiently check whether a string is in the list of the words. Therefore, we build a set from the list of words.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
class Solution {
    public List<String> findConcatenatedWords(String[] words) {
        // Create dictionary
        Set<String> wordSet = new HashSet<>();
        for (String word : words) {
            wordSet.add(word);
        }

        List<String> answer = new ArrayList<>();
        for (String word : words) {
            // DFS on each word to break it down
            if (canPartition(/* TODO */)) {
                answer.add(word);
            }
        }

        return answer;
    }
}

Then, we finally implement the canPartition method to break down the word into a tree of other words using a DFS.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
private boolean canPartition(String word, Set<String> wordSet, int startIndex) {
    int wordLength = word.length();

    if (startIndex == wordLength) {
        return true;
    }

    for (int i = startIndex; i < wordLength; i++) {
        String part = word.substring(startIndex, i + 1);

        // if the word is in the list of the words, recurse.
        if (wordSet.contains(part) && canPartition(word, wordSet, i + 1)) {
            return true;
        }
    }

    return false;
}

Only, this implementation returns true for every word in the list because a word is composed of itself. There are two ways to overcome this problem. First way is to keep track of the depth of the recursion. If the depth of the recursion is greater than 1, then the word must have more than one components.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
private boolean canPartition(String word, Set<String> wordSet, int startIndex, int depth) {
    int wordLength = word.length();

    if (startIndex == wordLength) {
        return depth > 1;
    }

    for (int i = startIndex; i < wordLength; i++) {
        String part = word.substring(startIndex, i + 1);
        if (wordSet.contains(part) && canPartition(word, wordSet, i + 1, depth + 1)) {
            return true;
        }
    }

    return false;
}

Another way is to stop the recursion if a part is equal to the original word. If this happens, we know that the word cannot be partitioned, because we have reached the end of the word without being able to break the word into other parts. We choose this approach, rather than keeping track of the depth. It is a good idea to simplify by having fewer states while solving sub-problems, as we will see when we try to form an iterative solution later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
private boolean canPartition(String word, Set<String> wordSet, int startIndex) {
    int wordLength = word.length();

    if (startIndex == wordLength) {
        return depth > 1;
    }

    for (int i = startIndex; i < wordLength; i++) {
        String part = word.substring(startIndex, i + 1);
        if (part.equals(word)) {
            return false;
        }

        if (wordSet.contains(part) && canPartition(word, wordSet, i + 1)) {
            return true;
        }
    }

    return false;
}

So far, our initial solution is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class Solution {
    public List<String> findConcatenatedWords(String[] words) {
        // Create dictionary
        Set<String> wordSet = new HashSet<>();
        for (String word : words) {
            wordSet.add(word);
        }

        List<String> answer = new ArrayList<>();
        for (String word : words) {
            // DFS on each word to break it down
            if (canPartition(word, wordSet, 0)) {
                answer.add(word);
            }
        }

        return answer;
    }

    private boolean canPartition(String word, Set<String> wordSet, int startIndex) {
        int wordLength = word.length();

        if (startIndex == wordLength) {
            return depth > 1;
        }

        for (int i = startIndex; i < wordLength; i++) {
            String part = word.substring(startIndex, i + 1);
            if (part.equals(word)) {
                return false;
            }

            if (wordSet.contains(part) && canPartition(word, wordSet, i + 1)) {
                return true;
            }
        }

        return false;
    }
}

Complexity

Let’s say N is the number of words in the list, and K is the length of the longest word in the list.

Time complexity is O(N * K^3). The reason is as follows: First of all, we are looping through all the words in the list (N). For each word, the maximum number of iterations over the characters is (K-1) + (K-2) + … + 1 ~= O(K^2). Let’s say we are iterating through a word ‘abcdefg’. Suppose that the first iteration in the outer loop causes the inner loop to go through each letters in ‘bcdefg’. And the second iteration of the outer loop results in the inner loop over the each letters of ‘cdefg’, and so on. Each iteration creates a substring which takes a time proportional to the length of the string (K). All in all, the time complexity becomes O(N * K^3).

Space complexity is O(N * K). The word set takes a space proportional to N * K. In addition, the call stack could be as deep as K, if each letter of the word triggers a recursion.

Optimizing with memoiziation

When solving a problem using a divide-and-conquer approach, it is a good idea to check whether we can avoid solving a same sub-problem multiple times. One common way is to memoize solutions to sub-problems.

In our initial DFS solution, we end up repeatedly solving a sub-problem at a particular startIndex even though we might have solved the same problem before. For example, let’s consider an input ["x", "xx", "xxx", "xxxzzzzzzz"]. The DFS will discover that ‘xxxxxzzzzzzz’ cannot be partitioned because ‘zzzzzzz’ is not in the list of the words. Nevertheless, we will keep end up solving the sub-problem ‘zzzzzzz’. To avoid doing so, we can memoize the sub-problem solution in an array.

In the following approach, we change the canPartition method to call a separate dfs method, passing a boolean array for memoization. When you are optimizing a recursive approach using memoization, it is a good idea to separate the actual recursive logic into a helper function, while keeping track of memoized results in the main function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
private boolean canPartition(String word, Set<String> wordSet) {
    Boolean[] memo = new Boolean[word.length()];

    return dfs(word, wordSet, 0, memo);
}

private boolean dfs(String word, Set<String> wordSet, int startIndex, Boolean[] memo) {
    if (startIndex == word.length()) {
        return true;
    }
    if (memo[startIndex] != null) {
        return memo[startIndex];
    }

    for (int i = startIndex; i < word.length(); i++) {
        String part = word.substring(startIndex, i + 1);
        if (part.equals(word)) {
            return false;
        }
        if (wordSet.contains(part) && dfs(word, wordSet, i + 1, memo)) {
            return true;
        }
    }

    memo[startIndex] = false;
    return false;
}

While the complexity remains unchanged from the initial solution, the memoization can speed up the run time for some inputs.

Iterative approach

From the solution with memoization, we can deduce a the following rules for the sub-problems. If we define n as the length of the word, and partition(i) as a sub-problem of whether word.substring(i, n) is a composite word, then:

partition(n) = true
partition(i) = true if word[i..j] is in the list and partition(j+1) is true, for some j such that i <= j < n.
- if i == 0, then j must not be n - 1. In other words, the part must not be the word itself.

Let’s define an array dp to keep track of the sub-problem solutions. Since we know partition(n) is true, we can walk backwards from the last character, and populate dp. At the end, dp[0] will contain the solution for the entire problem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
private boolean canParititon(String word, Set<String> wordSet) {
    int n = word.length();
    boolean dp[] = new boolean[n+1];
    dp[n] = true;

    for (int i = n - 1; i >= 0; i--) {
        for (int j = i; j < n; j++) {
            String part = word.substring(i, j+1);

            if (wordSet.contains(part) && dp[j+1]) {
                if (i != 0 || j != n - 1) {
                    dp[i] = true;
                    break;
                }
            }
        }
    }

    return dp[0];
}

The time and space complexity is the same as that of the recursive DFS approach. Specifically, if we define N as the number of words, and K as the length of the longest word, the time complexity is O(N * K^3) because we loop through every word, and for every word, canPartition loops ~K^2 times while performing substring operation which takes ~K time. The space complexity is dominated by the word set which takes up O(N * K) space.

Validating the input

In both real life and an interview, it is important to clarify the constraints of the problem, and validate the input. Here, we implement a validateInput method and call it at the beginning to throw exception if the input is not valid. Note that validateInput does not change the complexity of the solution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class Solution {
    public List<String> findConcatenatedWords(String[] words) {
        validateInput(words);

        // ...
    }

    private void validateInput(String[] words) {
        if (words.length < 1 || words.length > 10e3) {
            throw new IllegalArgumentException("words length is out of bound.");
        }

        Set<String> seen = new HashSet<>();
        for (String word : words) {
            if (word.length() < 1 || word.length() > 30) {
                throw new IllegalArgumentException(String.format("word length is out of bound: '%s'", word));
            }

            if (seen.contains(word)) {
                throw new IllegalArgumentException(String.format("duplicate word: '%s", word));
            }

            if (!word.equals(word.toLowerCase())) {
                throw new IllegalArgumentException(String.format("word is not lowercase: '%s", word));
            }

            seen.add(word);
        }
    }
}

Follow-ups

One possible follow-up question is to extend the solution to find all composite words formed by concatenating 3 or more words. We could achieve this by keeping track of the depth in the recursive solution, as we saw in the initial thought process section. We can check that the depth is greater than 2 in the base case. Alternatively, we might choose to iterate through the memoization array at the end to see whether we have more than 3 true values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
private boolean canParititon(String word, Set<String> wordSet) {
    int n = word.length();
    boolean dp[] = new boolean[n+1];
    dp[n] = true;

    for (int i = n - 1; i >= 0; i--) {
        for (int j = i; j < n; j++) {
            String part = word.substring(i, j+1);

            if (wordSet.contains(part) && dp[j+1]) {
                if (i != 0 || j != n - 1) {
                    dp[i] = true;
                    break;
                }
            }
        }
    }

    int count = 0;
    for (int i = 0; i < n; i++) {
        if (dp[i]) {
            count++;
        }
    }

    return count >= 3;
}

Lessons learned

When you have a recursive solution, check whether the result of a sub-problem can be memoized.
To optimize a recursive solution using memoization, it might be a good idea to move the recursive logic into a helper function. That way, the main function can initialize a data structure for memoization and call the helper function with it.
Minimize the number of states that a recursive solution needs to track. That way, it is simpler to memoize the sub-problem solutions.
We can derive the rules for sub-problems by observing the recursive approach, and use the rules to come up with an iterative approach.