A challenge: remove tags from a string.

Recently I have implemented a small piece of functionality, which is suitable to be a small challenge.

Description

Given a string with opening and closing tags (e.g., <mark></mark>), return a string without tags and indices of opening and closing tags. For example, given a string: "we eat <mark>healthy</mark> and <mark>tasty</mark> food." We expect to receive a string: "we eat healthy and tasty food.", and an array with pairs of indices: [[7, 14], [19, 24]]. There are two pairs of indices, because there were two pairs of <mark></mark> tags.

Ruby

First, I tried to solve this problem imperative way, which did not work very well. Solving the problem with recursion was easier:

  def remove_tag(str)
    str.sub(/<(\/?)mark>/, '')
  end

  def compute_text_with_indices(str, indices = [])
    if /<(\/?)mark>/.match(str) == nil
      {text: str, indices: indices.each_slice(2).to_a}
    else
      indices << /<(\/?)mark>/.match(str).begin(0)
      compute_text_with_indices(remove_tag(str), indices)
    end
  end

The trick here is to compute position of each subsequent tag after the previous one has been removed. It is needed to account for the shifting of the marked part of the string after removal of tags. Ruby provides a convenient method to retrieve index of the _n_th match - begin(n) (where n is index of an element in the match array). There is also end(n) method.

Elixir (Ruby-ish)

Then I tried to solve the same problem with Elixir. My first attempt was to use only pattern matching and recursion, but that didn’t look as pretty as expected (next solution), so I went on to try to repeat my ruby solution.

defmodule MyString do
  def remove_tags(str, indices \\ []) do
    unless String.contains?(str, ["<mark>", "</mark>"]) do
      %{ text: str, indices: Enum.chunk(Enum.reverse(indices), 2) }
    else
      remove_tags(remove_tag(str),
                  [ get_match_index(str, ~r/<(\/?)mark>/) ] ++ indices
                 )
    end
  end

  def remove_tag(str) do
    Regex.replace(~r/<(\/?)mark>/,str , "", global: false)
  end

  def get_match_index(str, regex) do
    Regex.run(regex, str, return: :index, capture: :first)
    |> List.first
    |> Tuple.to_list
    |> List.first
  end
end

With Elixir I had to write a small function to extract beginning of a match - get_match_index/2. It was unclear for me what exactly Regex.run/2 returns in [{integer, integer}]. At first I assumed it was beginning and end of a match. Later, when I saw [{12, 7}] it became clear that it probably returns beginning of the match and it’s length.

Elixir

Since Elixir is a functional language I expected to write code without any if\unless statements easily. Even though I managed to do this, I have a feeling that it does look beautiful:

defmodule MyString do
  def remove_tags(str), do: remove_tags(str, [], has_match?(str) )

  def remove_tags(str, acc, false), do: %{ text: str, indices: Enum.chunk(Enum.reverse(acc), 2) }
  def remove_tags(str, acc, true) do
    remove_tags(remove_tag(str),
                [ get_match_index(str, ~r/<(\/?)mark>/) ] ++ acc,
                has_match?(remove_tag(str))
    )
  end

  def remove_tag(""),  do: []
  def remove_tag(str), do: Regex.replace(~r/<(\/?)mark>/,str , "", global: false)

  def has_match?(str) do
    String.contains?(str, ["<mark>", "</mark>"])
  end

  def get_match_index(str, regex) do
    Regex.run(regex, str, return: :index, capture: :first)
    |> List.first
    |> Tuple.to_list
    |> List.first
  end
end

In this version I use pattern matching instead of explicit conditional to decide whether the program should iterate more or terminate.

Conclusion

Even such a small problem as removing tags from a string can take some time to get right. We see that Elixir can be written very similar to Ruby and can be written very differently.