W12: update after long

Long time no write! This is a blog to publish updates on my projects, so that's exactly what I'll do here.

Regex engine

Right now I'm writing a fast regular expression search engine in Rust. In case the reader doesn't know what a regular expression is, they allow specifying some rules about what some text should look like (e.g., think “starts with H”, “has 5 letters”), and then you can reason whether a text follows them (like the words “Hello” or “Hopes” do), or search for strings fulfilling the rules within a larger text. In our case, the regular expression to specify these rules would be H[A-Za-z]{4} (“H” followed by 4 letters).

There are quite a few software libraries that implement regular expressions in Rust. To name a few:

I ran a benchmark consisting on generating various lengths of random sequences of A, C, G and T and then searching for all (possibly overlapping) matches of the regex ATG([^T]..|T([CT].|G[^A]|A[CT]))*T(A[AG]|GA) (which recognizes possible forward open reading frames). The results are these:

If we replace that regex with ATG(...)*?T(A[AG]|GA) (which matches the same regions but uses the lazy *? metacharacter for a more concise representation), then we get:

regex and resharp had also the highest overhead, in the hundreds of microseconds, while both regexr and my crate could compile the regex in about 20 microseconds.

I would like to work a bit more on this crate. Specifically:

Career

I'm working as a Resident Physician in Psychiatry. Also trying to get a Master's degree in Biomedical Engineering. Pretty exciting stuff going on in that front.