Authl 0.5.2 was just released, bringing support for PKCE. And that surfaced another bug in Authorio.
Authorio has long had support for PKCE, or so I had thought. The test suite has cases that test for correct challenges and verifiers, and also that it rejects a verifier that doesn’t properly match. When the Authl developer wanted me to test their new support for profiles, I had a small problem when their server didn’t support PKCE (a so-called legacy client). Authorio was supposed to handle that but a separate bug had made it so that it required PKCE. That was a simple fix though.
Yesterday fluffy added PKCE to Authl and asked me to test it. Their first iteration had a bug and wasn’t generating code_challenges correctly, and a separate IndieAuth implementation rejected it. But Authorio had a successful login! It was approving client logins even when the code_challenge was mismatched. Why?
I started with some basic instrumentation and soon discovered that the session was blank in the code verification controller. When the initial code_challenge is sent, Authorio stashes that in the session, and pulls it back out to compare to the code_verifier for the second authentication phase. If the code_challenge is blank, Authorio assumes it’s dealing with a legacy client that doesn’t use PKCE and approves the request anyway. So that’s why Authorio was approving requests with a mismatched code_verifier – by the time it tried to compare it, the code_challenge it was comparing it to no longer existed.
Why was the session missing? This sent me down the long rabbit hole of Rails’ forgery protection. To protect against Cross Site Request Forgery, Rails uses a somewhat complex system of tokens which are stashed on each page and in forms that are posted, and then compares the tokens on incoming requests. If there’s a mismatch there Rails assumes there’s some monkey business going on, and takes some preventative steps, one of which is to nullify the session.
I knew all of this beforehand. I’ve had to deal with CRSF before, and I had turned it off for the requests that work as an API, the requests where a client posts data to the endpoint. Those requests are secured by PKCE so there’s no need for CSRF protection, and there’s no way to get the CSRF prevention token over to the client anyways. But why was Rails nullifying my session when I had told it not to, for those endpoints?
After tracing through Rails code for a while I realized, Rails wasn’t nullifying my session. The session was empty because it was coming from a different host. The code_challenge is initially set in the session on the user’s machine, when they authenticate. Then the client, a separate host, posts the code_verifier. Two hosts, two sessions. I can’t use the session to store the code_challenge, that will never work. The bug actually had nothing to do with CSRF protection at all.
Ok, but I have tests! Why were my integration tests working? Well, the integration tests simulate connections coming from the user and the client, but they are both actually coming from the same host, just my laptop running the tests. So for that case, the session actually worked to pass the code_challenge between requests!
I had to change the test to make it simulate two separate hosts. And this is something that Rails has change several times over the years. Whenever that happens, you can be sure that stackoverflow is full of wrong, outdated answers. The correct answer (current as of September 2021, Rails 6.1 for anyone reading this) is to create a new session for requests that you want to simulate coming from a different host, like so:
Now that my tests were properly failing, I needed a way to pass code_challenges between the two endpoint calls for the authentication flow. Since I can’t use the session, I’ll just add it as an attribute on the auth request and save it in the database.
I had actually done that originally, in an early version of the code. But after reviewing the code for Acquiesencence I decided to move as much of my state as possible to the session, since that’s what it does and it seemed a little cleaner. But Acquiescence doesn’t implement PKCE. D’oh!
After moving the code_challenge to the data model, the code verification was actually running, instead of just passing everything becuase it thought they were all legacy clients. And now the code failed to verify all the sites that it used to be able to log into. This was because there was a bug in my verification code, that had never been revealed because that code had never run. Well, it had been run, I had a test that checked valid and invalid code verification, but of course I needed seed data to test that with and I had generated those values by running some test data through the code to see what it expected.
That was the last bug. (I was taking a BaseURL encode of the hex digest of the SHA256, when I needed to do it on the raw byte values). Now the PKCE code verification is running properly.