Email Obfuscation with Stimulus

13 May 2022

8 minute read

software

Keeping email addresses away from harvester bots is an old problem. I decided to apply some new technology to it.

I’m currently working on updating my site from Rails 6 to Rails 7. It’s a rather big change, and I’m taking the opportunity to make some other changes on the site. I discovered Google has some problems with page loading times (although I never thought the site was slow), so I’m trying to optimize the load times. I’m removing jQuery, which was a rather large javascript library that I was barely using, and replacing it with the new Stimulus API from Rails. I’m also removing Bootstrap, which technically isn’t part of Rails, and replacing it with Tailwind. Tailwind should likewise be a little slimmer than Bootstrap was.

One of the bits of Javascript I’m updating is my email obfuscation code. There’s several email address links on the site (for contact info and so forth) that I obscure to prevent being scraped by bots. There are bots that crawl the web, scraping website html, looking for email addresses so they can add them to spam lists. It’s a problem that websites have had to deal with for a long time. There are three main approaches to solving the problem:

You can get rid of email altogether, and put up your own form if people want to write to you. I don’t want to toss out email though. I like email. I want people to be able to email me.
You can edit your address so it’s not a real email address. That way a simple bot using a regex won’t find it. You’ll see things like “name [at] example.com” or “You can email me by appending the number 4 to my name”. That puts an unacceptable burden on the user, in my opinion. Browsers support email links, and I want the user to be able to click on a link and hav their mail client come up, if they’ve configured that.
You can use Javascript. This typically involves adding an inline <script> tag next to a garbled email address. If you only scrape the HTML like most bots, you won’t see the email address. But a browser will execute the javascript, which decodes the email link into something fully-functional.

I think the third option is the best. But there’s a catch. Putting an inline <script> tag in your HTML is considered a bad practice. There are some browsers that won’t execute it due to security concerns (it’s possible for a malicious agent to inject javascript into a site that allows user-created content, like a comment section. If it’s not properly sanitized, it can attack any user who views the page).

Also, there’s a small chance that the email harvester bot will run the javascript code if it sees it right next to the email address. That’s a small chance, but this is a game of how much effort do you want to spend vs. how much effort are the bots going to make? I’d guess most bots only scrape the HTML and don’t bother with running scripts, but you don’t know.

My old script did live inline with the HTML page, but it wasn’t directly adjacent to the email links it was modifying. With the new website, I’m moving everything to Stimulus. Also, the old system required me to manually encrypt the addresses, then paste the encrypted addresses into the HTML for the javascript to decrypt. I’m changing that, so I only directly handle the plaintext addresses, to make it easier to use.

Two parts, encoding and decoding

Edited 05-30: I changed the encryption algorithm from what I had originally, because the first version was sometimes producing characters outside the HTML safe range.

Now I have two separate pieces of code, one that will encrypt the email addresses when it renders the HTML, and another that will decrypt them on the client side, after it’s downloaded the encrypted links. This raises a new issue.

The encryption code is written in Ruby, since it’s being done by a view helper. But the decryption is being done in Javascript, since that is being run by Stimulus. That means we can’t rely on any library functions to do our encoding, since there’s no guarantee that the ruby and javascript libraries will work the same way. That’s especially true for pseudo-random number generation.

We have to write our own random number generator. The good news is, we don’t need a lot of randomness. This doesn’t have to be cryptographically secure. All I want to do is scramble the bits of the address link a bit, and for that, I can use a simple Linear Shift Register. We only need to make sure that the code works the same on both the Ruby and Javascript sides.

On the Ruby side, I have these functions added in application_helper.rb:

def obfuscated_email_tag(address)
  link_to '', '#', data: {
    'controller': 'obfuscate',
    'obfuscate-address-value': encode(address).force_encoding('UTF-8')
  }
end

private

def encode(plaintext)
  seed = rand(1..15)
  random16 = xorshift4(seed)
  obfuscated = [seed + 64]
  plaintext.codepoints.each { |c| obfuscated.append(((c - 0x20) + random16.()) % 94 + 0x20) }
  obfuscated.pack("C*")
end

def xorshift4(seed)
  state = seed
  lambda {
    bit = ((state & 0x08) >> 3) ^ ((state & 0x04) >> 2)
    state = (state << 1) & 0x0f | bit
  }
end

All this does is randomly shift each character by a few positions (eg, “b” becomes “o” or “m”). Again, it’s not very sophisticated. It’s maybe one step up from rot-13. But it doesn’t have to be cryptographically secure, since all we’re trying to do is confuse some bots. With this helper, in any template I can write <%= obfuscated_email_tag 'hello@example.com' %> and it will come out looking something like <a data-controller="obfuscate" data-obfuscate-address-value="BblfbokfG}ke|dpe}f+lzu%dc" href="#">. That doesn’t look anything remotely like an email address.

Next in app/javascript/controllers/obfuscate_controller.js is the decryption code

import { Controller } from "@hotwired/stimulus"

// Connects to data-controller="obfuscate"
export default class extends Controller {
  static values = { address: String }

  connect() {
    const decoded = this.decode(this.addressValue);
    this.element.setAttribute("href", "mailto:"+decoded);
    this.element.innerHTML = decoded;
  }

  decode(s) {
    const codes = Array.from(s, (c) => c.charCodeAt(0));
    const seed = codes.shift() - 64;
    const random16 = this.xorshift4(seed);
    return String.fromCharCode(...codes.map(c => { return( ((c - 0x20) - random16() + 94) % 94 + 0x20 ) }));
  }

  xorshift4(seed) {
    var state = seed;
    return function() {
      const bit = ((state & 0x08) >> 3) ^ ((state & 0x04) >> 2);
      state = (state << 1) & 0x0f | bit;
      return state;
    }
  }
}

This gets run by Stimulus, as soon as it parses the HTML, and turns the link back into its original text by shifting all the characters back, using the same pseudo-random number sequence (which we have replicated in javascript).

This in fact is a step above simply having in inline script. An inline script, especially if it’s right next to the email address it decodes, could be run by a more sophisticated bot. I’d guess most harvester bots don’t run Javascript at all, but maybe some small percentage do, if they see an inline script. But the Stimulus code isn’t inline. The script is there on the page, of course, but Stimulus uses the Mutation Observer to connect its controllers as the DOM is built. For a bot to do this, it would basically have to emulate what a browser client does. It’s just another hurdle I’m placing for the harvester bots to clear, and I think it’s one that most won’t bother with.

With these two pieces in place, I can include email links in my site easily, since all I have to do is wrap them in my helper tag. The HTML that is generated is opaque to bots that simply crawl the HTML, and they would have to run a fair amount of Javascript code to decrypt the links. For the end user, all they see is a normal email link that they can click on to send email.

There is one potential downside, and that is that this requires Javascript. If the user has Javascript turned off, then they won’t see the email link at all. I’m willing to live with that. Most people are running Javascript these days, and the people who have it disabled know that it degrades their experience on a lot of sites. I think a user who has Javascript disabled would be sophisticated enough to realize that that’s why the email link is broken, and would know to enable Javascript if they wanted to see the link.

Email Obfuscation with Stimulus

Two parts, encoding and decoding

Comments and Webmentions

Related posts

Email Obfuscation with Stimulus

Two parts, encoding and decoding

Comments and Webmentions

Related posts

How to Filter Bots from Your Nginx Log Files 29 Nov 2021

Server Upgraded to Debian Bullseye 13 Sep 2021

How I Backup my Server for Less than a Penny per Year 25 Feb 2022