Patterns Look Different When a Machine Has to Actually Check Them

Published at March 16, 2026 ... views


One thing that clicked for me recently is how different it feels to describe a pattern versus checking whether something matches it.

Describing is easy. You can say "any number of a's followed by any number of b's" and a human immediately gets it. But if you had to build a machine — a very simple machine, with no scratch paper, no memory beyond where it currently is — could it still check that pattern?

It turns out, yes. And the machine that does it is called a deterministic finite automaton, or DFA. It's the simplest computation model that exists, and yet it can recognize a surprisingly wide range of patterns.

In the previous post, I covered the building blocks: alphabets, strings, languages, and the three operations (union, concatenation, Kleene star). This post picks up right where that left off — first making regular expressions formal, then introducing the machines that actually do the checking.

Regular expressions are recursive — you build them from the bottom up

The formal definition of a regular expression is elegant because it's recursive. You start from three atomic pieces, and you combine them using three operations. That's it. Everything else is built from those.

Loading diagram...

The rules:

  1. Any single character from the alphabet is a regular expression
  2. ε (the empty string) is a regular expression
  3. ∅ (the empty set) is a regular expression
  4. If R₁ and R₂ are regular expressions, then R₁ ∪ R₂ is too
  5. If R₁ and R₂ are regular expressions, then R₁R₂ (concatenation) is too
  6. If R is a regular expression, then R* is too

That's the complete definition. Every regular expression you'll ever write can be traced back to these six rules applied some number of times.

Think of it like cooking. You start with raw ingredients (individual characters, ε, ∅). Then you combine them using three techniques (union, concatenation, Kleene star) — and you can keep combining the results with more of the same techniques, layering complexity until you have something sophisticated. A Beurre blanc is just butter, shallots, and wine — combined in a specific recursive sequence of reductions.

A student combines symbolic ingredients into a glowing regular expression recipe

Shorthand you'll sometimes see

There are two common shortcuts that aren't part of the formal definition but are built from it:

  • R⁺ = RR* (one or more repetitions of R)
  • Rᵏ = R concatenated with itself k times

These are just conveniences. They don't add any power — anything you can write with R⁺ or Rᵏ, you can write with the six rules above.

The L(R) notation keeps descriptions and languages separate

This is a small but important point. A regular expression R is a string of symbols — it's a description. The language it describes is written L(R).

Loading diagram...

So when you write 0*1, that's a pattern. When you write L(0*1), that's the actual set of strings matching that pattern.

Some textbooks blur this distinction and write things like 0*1 = #123;1, 01, 001, ...#125;. That's technically sloppy because the left side is a description and the right side is a set — two different types of objects. But you'll see it in the wild.

A recipe card becomes a finished dish through an L notation arrow

Order of operations matters more than you'd think

Just like arithmetic has "multiplication before addition," regular expressions have a precedence order:

  1. Kleene star (highest — applied first)
  2. Concatenation
  3. Union (lowest — applied last)

This means ab* is NOT (ab)*. It's a(b*) — the string a followed by zero or more b's.

Loading diagram...

This is exactly like how 2 + 3 × 4 equals 14 (not 20) because multiplication binds tighter than addition. Parentheses override the default order when you need them.

The empty set is a black hole for concatenation

Here's something that tripped me up at first. When you concatenate anything with the empty set, you get the empty set back.

L(R ∘ ∅) = ∅ — no matter what R is.

Why? Because concatenation needs a string from both operands. If one of them has no strings, you can't build any concatenation at all. It's like trying to make a sandwich when you have bread but no filling — and the recipe requires both.

But the empty string works differently:

L(R ∘ ε) = L(R) — concatenating with the empty string changes nothing.

Loading diagram...

∅ destroys concatenation. ε is invisible to it. ∅ is invisible to union.

A black hole swallows a string while epsilon passes through unchanged

Now let's push it — how far can regular expressions go?

This is the question I find most interesting. We've built up this notation with three operations. Can it describe any language? Or are there patterns too complex for it?

Keep that question in mind. We won't fully answer it until we get to the pumping lemma in a later post. But the hint is: yes, there are languages that no regular expression can describe. And proving that turns out to be one of the most elegant arguments in computer science.

For now, let's shift from describing patterns to checking them.

A DFA is a machine that reads one character at a time and never looks back

Imagine you're a security guard checking IDs at a door. You have a very specific set of rules:

  • You start in a "waiting" state
  • When someone shows their ID, you check one piece of information at a time
  • Based on what you see and what state you're currently in, you transition to a new state
  • At the end, you're either in an "allow entry" state or a "deny entry" state
  • You can't go back and re-check something you already saw

That's a DFA. It's a machine with:

  • A finite set of states (your possible mental states while checking)
  • An alphabet (the characters it can read)
  • A transition function (for every state + character combination, there's exactly one next state)
  • A start state (where computation begins)
  • A set of accept states (the "yes" outcomes)
Loading diagram...

The formal definition (from Sipser's Definition 1.5) writes it as a 5-tuple: (Q, Σ, δ, q₀, F).

The transition function δ is the heart of the machine. It's a function that takes a state and a character and returns exactly one next state. No ambiguity, no choices — that's what makes it deterministic.

A security guard checks character cards one at a time at a campus door

DFAs look like directed graphs with a special vocabulary

When we draw a DFA, it's a directed graph where:

  • Circles are states
  • Arrows are transitions (labeled with characters)
  • The start state gets an incoming arrow from nowhere
  • Accept states get double circles
  • Every state has exactly one outgoing arrow per character in the alphabet

Here's a concrete example. This DFA recognizes strings over a, b where a's are never followed by b's that are then followed by a's again. In other words: all the a's come first, then all the b's.

Loading diagram...

States q0, q1, and q2 are accept states (shown as double circles in actual diagrams). q3 is a "trap state" — once you get there, you can never leave, and the string will be rejected.

Computation is just following arrows

Given an input string, computation works like this:

  1. Start at the start state
  2. Read the first character of the input
  3. Follow the arrow labeled with that character to the next state
  4. Read the next character, follow the corresponding arrow
  5. Repeat until you've consumed the entire string
  6. If you end in an accept state → accept. Otherwise → reject.

Let's trace through the input aabb on the DFA above:

Loading diagram...

Wait — that ends in q3 (reject). But aabb should be "a's followed by b's." Let me re-examine the DFA... Actually, q1 transitions to q3 on b, meaning that once you've seen a's and then encounter a b, you go to the trap state.

That's the wrong DFA for this language. Let me use the correct one from the lecture: a DFA where q1 → q4 on b, and q4 is the accept state for the "seen b after a" phase.

Here's the correct DFA for "a's followed by b's" (at least one a, at least one b):

Loading diagram...

Here q4 is the only accept state. q2 is a "dead" state — once a pattern violation is detected, you end up there permanently.

Tracing aabb:

StepStateReadNext State
Startq1aq3
2q3aq3
3q3bq4
4q4bq4

End state: q4 (accept). The string aabb matches the pattern.

Tracing abab:

StepStateReadNext State
Startq1aq3
2q3bq4
3q4aq2
4q2bq2

End state: q2 (reject). The string abab doesn't match because an a appeared after a b.

Every state in a DFA has a "job"

This is something I picked up that makes DFAs much easier to reason about. Instead of thinking about states as abstract labels, assign each one a role — a description of what being in that state "means" about the input so far.

For the DFA above:

StateRole
q1Haven't seen anything yet
q3Seen one or more a's, no b's yet
q4Seen a's then b's — pattern is valid so far
q2Pattern violated — dead state

This is like how a recipe has stages. You're either "prepping ingredients," "cooking the base," "adding sauce," or "it's burnt and there's no going back." Each stage determines what you should do next. A DFA works the same way — your current state encodes everything you need to know about what you've seen so far.

Loading diagram...

The key insight: the current state is the DFA's only form of memory. It can't remember the exact string it's seen — only which state it ended up in. That's what makes it "finite." And that's also what limits it — but more on that later.

A kitchen workflow maps DFA states to jobs and a ruined pot dead state

The language of a DFA is the set of all strings it accepts

Every DFA defines a language: the set of all strings it accepts. We write this as L(M) where M is the DFA.

Loading diagram...

This connects directly to regular expressions. The language of the DFA above is the same language described by the regular expression a⁺b⁺ — one or more a's followed by one or more b's.

And this raises the big question from the previous post: is there a regular expression for every DFA? And conversely, is there a DFA for every regular expression?

The answer to both is yes. That equivalence is one of the foundational results of this field, and we'll prove it formally in a future post. But for now, just know that these two formalisms — one algebraic (regular expressions), one mechanical (DFAs) — describe exactly the same class of languages: the regular languages.

Different DFAs can recognize different languages — and the proof is just one string

How do you prove two DFAs recognize different languages? You find a single string that one accepts and the other rejects.

That's it. One counterexample is enough to show L(M₁) ≠ L(M₂).

But proving two DFAs recognize the same language is much harder. You'd need to show that every string is handled identically by both machines. One counterexample breaks equality; confirming equality requires exhaustive proof. There are algorithms for this (we'll see them later), but the asymmetry is worth noting.

Complementing a DFA is as easy as flipping accept states

Here's something beautiful. Say you have a DFA that recognizes "strings where ab appears as a substring." What if you want the opposite — strings where ab never appears?

Loading diagram...

Just take the same DFA and swap the accept states with the non-accept states. Every string that was accepted is now rejected, and vice versa. The complement language is recognized by the complemented DFA.

This works because a DFA processes every string to completion — it always lands in some state. So flipping which states count as "accept" perfectly inverts the language.

Think of it like a music audition. The judges have a scorecard. If you flip the pass/fail threshold — what used to be "pass" becomes "fail" and vice versa — you've complemented the selection criteria without changing the audition process itself.

This property is called closure under complementation, and it's one of the reasons regular languages are so well-behaved mathematically.

Two mirrored DFA diagrams show accept states flipped for complement

DFAs in the real world are everywhere — you just don't notice them

The concept of a DFA shows up constantly in systems you interact with daily:

  • Traffic lights cycle through green → yellow → red → green based on timers (input signals). Each color is a state, and the transitions are deterministic.
  • Vending machines track how much money you've inserted. Each amount is a state. Inserting a coin transitions you to the next state. The "accept state" is when you have enough to dispense.
  • Elevator controllers respond to floor requests. The current floor is the state, and button presses are the input symbols that drive transitions.
  • Email validation — when you type an email address and a website highlights it red or green, that's often a DFA checking the pattern character by character.

Traffic lights, vending machines, elevators, and email fields each hide a tiny DFA

Loading diagram...

The connection is that all these systems have finite memory (a fixed number of states), deterministic rules (the same input always produces the same transition), and process inputs sequentially (one event at a time). That's exactly the DFA model.

Where this is heading

We now have two ways to describe regular languages: regular expressions (algebraic descriptions) and DFAs (mechanical checkers). Next up, we'll introduce nondeterministic finite automata (NFAs) — machines that can be in multiple states at once — and show that they're equivalent to DFAs despite being seemingly more powerful.

Then comes the pumping lemma, which gives us a way to prove that certain languages are not regular — that they're genuinely beyond what any DFA or regular expression can handle. That's where the real limits start to appear.

A few things I'm taking away

  • Regular expressions are defined recursively — you build them from atoms (characters, ε, ∅) using three operations (union, concatenation, Kleene star), and the result can be combined again using those same operations
  • The L(R) notation keeps the description (the regular expression) separate from the thing being described (the language) — mixing them up is like confusing a recipe with the dish it makes
  • Order of operations in regular expressions — Kleene star first, then concatenation, then union — can completely change what language is described
  • Concatenating with the empty set destroys everything (like multiplying by zero), but concatenating with the empty string changes nothing (like multiplying by one)
  • A DFA is a 5-tuple: states, alphabet, transition function, start state, and accept states — and the transition function is what makes it deterministic
  • Every state in a DFA has a "job" — assigning roles to states makes it much easier to understand and build DFAs
  • The current state is a DFA's only memory — it can't remember the exact input it's seen, only the state it ended up in
  • Different DFAs can recognize different languages, and a single counterexample string is enough to prove it
  • Complementing a DFA just means flipping which states are accept states — the machine stays the same, only the judgment changes
  • Regular expressions and DFAs describe exactly the same class of languages (regular languages) — this equivalence is a foundational result we'll prove later
  • DFAs show up everywhere in daily life: traffic lights, vending machines, elevators, email validators — any system with finite states and deterministic rules

That second-to-last point is the one that stays with me. Two completely different formalisms — one that looks like algebra, one that looks like a machine with arrows — and they have exactly the same expressive power. Not "roughly the same" or "overlapping." Exactly the same. That kind of perfect equivalence between seemingly unrelated models is what makes this field feel like it's uncovering something fundamental, not just inventing notation.


Comments