Title Case or Capitalize a Sentence or Word #DailyAlgorithm

Amazon AD Banner

This is the first "language-agnostic" #DailyAlgorithm post that I make and it's about both converting asentence to "title case" and capitalizing a string (can be a word or a full sentence) using a simple strategy with string splits and RegExp patterns (if need be). Even though the code is actually written in JavaScript , the strategy makes these algorithms language-agnostic indeed; some languages like Elixir already have a method to capitalize a string.

Warning: Reading about the recursive solution will take some time, if you want to skip ahead to the shortest RegExp solution, click here. At the end of the article I put the simplest code I could conceive to handle the conversion of a sentence to title case splitting only by whitespace.

Capitalizing a Word or Sentence

The process of capitalizing a string doesn't require much thought. Just imagine your word, now remember what capitalizing actually means: the first letter should be uppercase and the rest should be lowercase (it really depends on the trustworthiness of the input and if you wanna take in account uppercase abbrebiations too), that way, words like what, HeLlOo and daRKness can become What, Hello and Darkness respectively. In most programming languages you have a helper function to uppercase or lowercase a letter or string and another helper to subtract portions of your string.

function capitalize(str) {  
  if (str.length) {
    return str[0].toUpperCase() + str.substr(1).toLowerCase();
  } else {
    return '';
  }
}

And here's the ES6 version with a ternary operator just in case:

const capitalize = str => str.length?  
  str[0].toUpperCase() + str.substr(1).toLowerCase() : '';

What the function does is first look if the string passed to it is empty, case in which it returns an empty string. If it does have at least 1 character, perform the operation: uppercase the first letter and append the rest of the string but in lowercase (this step is optional but you do need to append the rest).

Title Casing a Sentence

I title case every subtitle in my posts, as you can see above, "Title Casing a Sentence" has every word capitalized, meaning that the first letter of each word was put in uppercase. For some reason, most algorithms to title case a sentence are made so that every single word including monosyllables are capitalized (like "or" and "a") and that's really up to the needs of the developer but I find it a good challenge to account for those cases where you want monosyllables NOT capitalized. I've seen bad attempts at using RegExp and I'm going to mention one method I came up with while struggling with this algorithm.

Some edge cases

  • What if I want to split my sentence by not only spaces but also dashes and opening exclamation or interrogation (for the Spanish language and others), forward slashes or even underscores? See what could happen if I don't take this edge case in account: hi super-man, you look amazing/cool... would turn into Hi Super-man, You Look Amazing/cool... instead of the desired Hi Super-Man, You Look Amazing/Cool; check this example as well: hola, ¿dónde Está el ¡¡baño!!? would become Hola, ¿dónde Está El ¡¡baño!!? instead of the obviously expected Hola, ¿Dónde Está El ¡¡Baño!!?.

  • I don't want to capitalize words that are filler like "for, to, and, or, onto, in, on, into" and so on. You can learn more about these words here.

The first edge case can be easily solved with two main strategies, one recursive and the other using a simple RegExp (regular expression pattern) and a little helper function to escape separators to make them compatible with a RegExp character list declaration or /[abc]/ for example. The second case I won't even look into it but if you're up for the challenge, go ahead!

Tip: Use an array of filler words you want to avoid capitalizing inside the capitalize() function.

Recursively With Map, Split, and Join

The first strategy is to split the string into individual words and when we finish capitalizing every word using map, join the new array of capitalized words into a string again using the given separator. We can easily accomplish this by using a feature that is included in various programming languages called "String splitting". What this thing does is that it takes a string and creates a list of chunks that were previously joined by a character, a set of characters or a pattern using a regular expression pattern (or RegExp).

But with this definition of the problem, we only solve one separator (the spaces which is the first priority); what we need to do to handle more than just one separator is either do a nasty nesting of splits, maps and joins or we can begin to think recursively. When we reach the part where we are about to capitalize each word in the array, what we would do is check to see if the word (or chunk of the original string) contains any of the remaining separators (we discard the first one because we just worked with it), repeat the process making a recursive call to the same function passing this time the aforementioned chunk of string and the remaining separators. The recursive call will stop once the innnermost chunks of strings (or words) don't contain any separator; in this case we just return the capitalized word or string chunk.

Let's begin by creating a helper function that checks if a string or array contains at least one of the elements of another string or array (because strings are iterable).

const hasAnyOf = (str, charlist) => {  
  for (let char of str) {
    if (charlist.indexOf(char) >= 0) return true;
  }

  return false;
};

Befuddled by the ES6 syntax? Just replace the arrow function with a regular function declaration and the for of with a regular for loop.

A quick note, older versions of IE don't support Array.prototype.indexOf(), you have to add a polyfill yourself.

Now let's create the rest of the code so you see how pieces start to add up on your mind:

const mySentence = "I'm the ¡very-best ¿¿agreed, you/ya aLL?";

// ... hasAnyOf() and capitalize() go here

const titleCase = (str, separators = ' _-¡¿/') => {  
  let [head, ...tail] = separators;

  return str.split(head)
    .map(str =>  hasAnyOf(str, tail)?
      titleCase(str, tail) : capitalize(str)
    ).join(head);
};

console.log( titleCase(mySentence) );  
// Outputs -> I'm The ¡Very-Best ¿¿Agreed, You/Ya All?

Analyze closely what's happening inside the titleCase function, it receives a string as a parameter (the sentence) and the other parameter is a list of separators you want to consider formatted as a continuous string. Since this is ES6 syntax, you can declare a default value to give you the choice of omitting your custom separators; this default parameter thing can be achieved in older versions of EcmaScript (JavaScript) with a simple optPar = optPar || <<your default value>>; inside the function (having declared optPar as a parameter). The first line of code inside the function extracts the first value of an iterable element, in this case, it will save the first separator (doesn't care if it's a string or an array of strings) inside a local variable called head and the rest of the string or array will be saved as an array in a variable called tail.

We then return the result of this process: split the string or sentence by the first separator (in this case, a space character) and transform each splitted chunk of the string (can think of it as a word if it helps) according to the following rule: if the chunk contains any of the remaining separators, call the same function recursively and this time pass the word as the str argument, then as the second argument, it should receive the remaining separators; however, if the chunk doesn't need to be split again, just return the capitalized word.

The last part of the process joins the capitalized chunks with the very same character that was used to split the string (head). Here's a cool little diagram that illustrates all this madness:

.------- Main Stack Frame --------------------------.
| str:            ¿dónde    Está    eL    ¡ba-ño!?  |
|                    v        v      v       v      |
| Split by ' ': ['¿dónde', 'Está', 'eL', ¡ba-ño!?'] |
| Mapping:          v         v      v       v      |
| Cap or recur:    (A)      Está    El      (B)     |
| Join by ' ':                   v                  |
| Return result:      "¿Dónde Está El ¡Ba-Ño!?"     |
'---------------------------------------------------'

.------- Stack Frame (A) -----------.
| After going through a range of 0  |
| to some stack frames, this is the |
| last stack frame in the (A) case  |
|-----------------------------------|
| str:              ¿ dónde         |
|                v      v           |
| Split by '¿': ['', 'dónde']       |
| Mapping:       v      v           |
| Cap or recur:       Dónde         |
| Join by '¿':          v           |
| Return result:     "¿Dónde"       |
'-----------------------------------'

.------- Stack Frame (B) -----------.
| After going through a range of 0  |
| to some stack frames, this is the |
| last stack frame in the (B) case  |
|-----------------------------------|
| str:            ¡ba  - ño?        |
|                  v      v         |
| Split by '-': ['¡ba', 'ño?']      |
| Mapping:         v      v         |
| Cap or recur:   (b)    Ño?        |
| Join by '-':        v             |
| Return result:   "¡Ba-Ño"         |
'-----------------------------------'

.------- Stack Frame (b) -----------.
| After going through a range of 0  |
| to some stack frames, this is the |
| last stack frame in the (b) case  |
|-----------------------------------|
| str:               ¡  ba          |
|                 v     v           |
| Split by '¡':  ['', 'ba']         |
| Mapping:        v     v           |
| Cap or recur:        Ba           |
| Join by '¡':          v           |
| Return result:      "¡Ba"         |
'-----------------------------------'  

Once the tree has been resolved inside the call-stack, you will get in return the sentence in title case. Give it a try in your browser's console and enjoy. The head-tail thingy in the first line of the function is called destructuring and is only available in ES6, however, here's how you can achieve the same thing without ES6:

var head = str[0];  
var tail = str.slice(1);  

There is one thing I almost forgot to mention, when we try to split a string by a character that isn't in there, you'll end up with an array with only one element: the very same string. But what happens when you join that string with the same character? You'll end up with the string you started with! This is important because it means our recursive call won't break when the head is a character not present in the string to split and join. We can fix this to be more efficient and prevent the creation of extra stack frames but let's leave it like that for now.

If you find this impractical, over-engineered or just plain wrong, the next strategy will make you feel calmer.

By Using Replace and RegExp

This is the second strategy, another approach worth experimenting, if you're not familiar with String.prototype.replace() I recommend you read MDN's documentation article about it. This strategy is simple but can get messy if you don't know the basics of regular expressions, basically what we are going to replace is the following: groups of characters that match the criteria of being composed of groups of characters that don't contain any of your separators, meaning, if you have "_ -/¿¡" as your separators and this example string: ¿¿ dun, hello world-hehe yo/hey na_naa, when we run the RegExp it will make the following groups: ['dun,', 'hello', 'world', 'hehe', 'yo', 'hey', 'na', 'naa']. This group represents your words to be capitalized!

If we were to use replace with our sentence, the RegExp will capture these words and capitalize them if we pass capitalize as the callback function (second parameter/argument). There is one little problem, though: we need to first escape the separators with a backslash, this is why I declared another helper function that will add a backslash behind every separator. Well, in reality, two, because inside a JavaScript string, the backslash will escape characters like single or double quotation marks (any, really).

const mySentence = "I'm the ¡very-best ¿¿agreed, you/ya aLL?";

// ... capitalize() goes here

const escape = str => str.replace(/./g, c => `\\${c}`);

const titleCase = (sentence, seps = ' _-¡¿/') => {  
  let wordPattern = new RegExp(`[^${escape(seps)}]+`, 'g');

  return sentence.replace(wordPattern, capitalize);
};

console.log( titleCase(mySentence) );  
// Outputs -> I'm The ¡Very-Best ¿¿Agreed, You/Ya All?

The escape helper function replaces EVERY single character in the string with itself, but having a backslash behind it. String interpolation is nice in ES6, for those who don't do ES6, it's just the equivalent of '\\' + c. The g flag is needed because it means "global", not including it will only replace the first occurence of the pattern.

Our new function declares a wordPattern that will determine how words are selected from the string. This is a RegExp constructor that assembles this: /[^\ \_\-\¡\¿\/]+/g behind the scenes. The caret character after the opening bracket will tell the RegExp engine that the characters inside the character list will not be taken into account, t's an inverse selector that basically means grab one or more characters that aren't "these".

Conclusion

My two cents? For me, the mapping and joining isn't that bad. Nowadays, computers have plenty of memory to handle extra arrays and the use cases of this algorithm aren't likely to need any significant optimization, besides, RegExp can be daunting for newcomers. Also, the first strategy is complicated if you don't understand recursion, the second one seems more elegant to me and handles extreme cases in my opinion. If you have more edge cases to consider or any other suggestion, let me know. For now, I'll put the code that only handles splitting and joining by whitespace down below:

const capitalize = str => str.length?  
  str[0].toUpperCase() + str.substr(1).toLowerCase() : '';

const titleCase = str => str.split(/\s+/).map(capitalize).join(' ');  

In modern languages, things get easier or more difficult depending on the implementation of functional programming concepts and native string methods. For example, in Elixir I could do something like this:

my_sentence = "I'm a good mama's boy"

IO.puts my_sentence  
  |> String.split
  |> Stream.map(&String.capitalize/1)
  |> Enum.join(" ")

Beautiful isn't it? If you're new to Elixir but are currently learning it, the code just prints the result of grabbing the sentence, splitting it by whitespace (if you want to see the full implementation of strategies one and two in Elixir, give me a heads up in the comment section), mapping it to contain the capitalized words and then joining the words by a space.

These comments are powered by Disqus