Regular Languages Stay Regular No Matter What You Do to Them

One thing I keep finding satisfying about this field is how often a proof boils down to: "just build the machine."

You want to show that some class of languages has a nice property? Don't argue abstractly — construct the machine that does the thing, then show it works. That's what this post is about.

In the previous post, we introduced DFAs — the simplest machines that can check whether a string matches a pattern. Now we're going to do two things: get better at building DFAs from scratch, and then prove that the class of languages they recognize is closed under complementation, union, and intersection.

"Closed" means that if you start with regular languages and apply these operations, you always get another regular language. You can't escape. It's like a kitchen where every dish you make from the ingredients on hand is guaranteed to still be food — you can't accidentally create something inedible by combining valid recipes.

A student arranges symbolic ingredients for closure proofs on a kitchen board

Building a DFA starts with assigning each state a job

The single most useful technique for building DFAs is to think of each state as having a role — a description of what it "means" to be in that state, given what you've seen so far in the input.

Let's build a DFA that recognizes the language "all strings with at least two a's" over the alphabet a, b.

The roles are obvious once you think about it:

State	Role	Accept?
q0	Haven't seen any a's yet	No
q1	Seen exactly one a so far	No
q2	Seen at least two a's	Yes

Loading diagram...

Notice that b-transitions always loop back to the same state. That's because b's don't affect the count of a's — they don't change what we "know" about the input so far. And q2 is a "sink" in the positive sense: once you've seen two a's, you'll always have seen at least two a's, no matter what comes next.

This is like inspecting a property for a real estate deal. You have a checklist. State q0 means "haven't found any of the required features yet." State q1 means "found one, still looking for the second." State q2 means "both requirements met — this property qualifies." Once you've checked both boxes, nothing else on the walk-through can un-check them.

A real estate inspection checklist maps feature checks to DFA states

Flipping the question gives you a new DFA for free

Now, what about "all strings with at most two a's"? The roles change slightly:

State	Role	Accept?
q0	Seen zero a's so far	Yes
q1	Seen exactly one a	Yes
q2	Seen exactly two a's	Yes
q3	Seen three or more a's	No

Loading diagram...

q3 is a trap state — once you get there, you're stuck forever. It's the DFA equivalent of burning the dish. You've seen too many a's, and no amount of additional input can fix that.

The key insight: q3 traps bad inputs permanently. Every DFA character must go somewhere (the transition function is total), so you need a state to absorb inputs that have already violated the pattern. Trap states are that solution.

Tracking a substring requires remembering progress

Here's a more interesting example: build a DFA that recognizes strings containing baba as a substring. This is harder because you need to track how much of the target word you've built so far.

The roles:

State	Role
q0	No progress toward "baba"
q1	Seen "b" — first character matched
q2	Seen "ba" — two characters matched
q3	Seen "bab" — three characters matched
q4	Seen "baba" — complete match (accept)

Loading diagram...

The tricky transitions are the ones where progress is "partially lost." From q2 (seen "ba"), if you see another a, you've seen "baa" — the pattern is broken, so you go back to q0. But from q3 (seen "bab"), if you see another b, you've seen "babb" — which ends in b, so you still have the start of a potential new baba. That's why q3 goes to q1 on b, not q0.

This is like learning a song on piano. You're trying to play a specific four-note sequence. If you hit the wrong note partway through, you don't always restart from scratch — sometimes the wrong note is actually the first note of a new attempt. Your "state" (how far into the sequence you are) needs to account for these overlaps.

Sheet music highlights an overlapping four-note motif as DFA progress states

If you want to see this overlap-recovery idea taken to the extreme — matching many patterns at once with a single automaton — Aho–Corasick is the canonical algorithm. Same trick, just generalized:

Loading video…

Regular languages are defined by what DFAs can recognize

Now for the big definition. A regular language is any language that can be recognized by some DFA.

Loading diagram...

Every DFA we've built so far proves that the corresponding language is regular. The language "at least two a's" is regular because we built a DFA for it. The language "contains baba" is regular because we built a DFA for it.

Later, we'll show that regular expressions describe exactly the same class of languages. But for now, "regular language = recognized by a DFA" is our working definition.

Closure means you can't escape the class

A class of languages is closed under an operation if applying that operation to languages in the class always produces another language in the class.

Loading diagram...

This is a powerful property. It means that regular languages are self-contained under these operations. You can combine them, negate them, intersect them — and you always stay within the same class.

Think of it like a music genre. If you take two jazz songs and blend them together, you still get jazz. If you take a jazz song and play it backwards, it's... still jazz (arguably). The genre is "closed" under these operations. Regular languages work the same way — they're closed under complement, union, intersection, concatenation, and Kleene star.

Jazz and blues records combine into a glowing fusion record

Complement: just flip the accept states

This is the simplest closure proof, and it's beautiful because the construction is so clean.

Theorem: If A is a regular language, then its complement Ā is also regular.

Proof idea: Take the DFA for A. Swap all accept states with reject states. Done.

Loading diagram...

Formally: given DFA M = (Q, Σ, δ, q₀, F), construct M' = (Q, Σ, δ, q₀, Q − F). Everything stays the same except the accept states.

Why it works: Every string either ends in an accept state (and gets accepted by M) or doesn't (and gets rejected by M). Flipping which states are "accept" perfectly inverts the outcome. If M accepted a string, M' rejects it, and vice versa. So L(M') = complement of L(M).

The formal proof is a bidirectional subset argument (showing L(M') ⊆ Ā and Ā ⊆ L(M')), but the intuition is immediate: same machine, opposite verdict.

Remember the "baba" substring DFA from earlier? Its complement — "strings where baba does NOT appear" — uses the exact same states and transitions. You just make q0, q1, q2, q3 the accept states and q4 the reject state. The machine that checks for a pattern is also, with one tweak, the machine that checks for its absence.

Union: run two machines in parallel

This one is more involved. We need to show that if A₁ and A₂ are regular languages, then A₁ ∪ A₂ is also regular.

The idea: Build a new DFA that simulates both original DFAs at the same time. At every step, it tracks where both machines would be if they were each processing the same input independently.

Loading diagram...

The Cartesian product construction

Given:

M₁ = (Q₁, Σ, δ₁, q₁, F₁) recognizing A₁
M₂ = (Q₂, Σ, δ₂, q₂, F₂) recognizing A₂

Construct M = (Q, Σ, δ, q₀, F) where:

Component	Definition
States Q	Q₁ × Q₂ (all ordered pairs)
Start state	(q₁, q₂)
Accept states F	`{(r,s) : r ∈ F₁ OR s ∈ F₂}`
Transition δ((r,s), a)	(δ₁(r,a), δ₂(s,a))

The transition function just runs both machines independently. If you're in state (r, s) and read character a, the first component goes where M₁ would send r on a, and the second component goes where M₂ would send s on a.

The accept condition captures "or" — if either component ends in an accept state of its respective machine, the combined machine accepts.

A concrete example

Say M₁ recognizes "strings starting with a" and M₂ recognizes "the string b" (just the single character b and nothing else).

M₁ has states q0, q1, q2 (q1 is accept — seen a first; q2 is trap — started with b). M₂ has states p0, p1, p2 (p1 is accept — read exactly "b"; p2 is trap).

The cross-product machine has states like (q0,p0), (q1,p2), etc. But not all pairs are reachable from the start state. You only need to draw the ones you can actually reach.

Loading diagram...

Accept states: (q1,p2) because q1 ∈ F₁, and (q2,p1) because p1 ∈ F₂. The machine accepts "strings starting with a" OR "the string b" — which is exactly A₁ ∪ A₂.

Notice that the cross-product could have 3 × 3 = 9 states, but only 4 are reachable. The unreachable ones exist in theory but don't need to be drawn.

This is like having two food critics judge the same restaurant meal simultaneously. Each critic has their own scoring rubric (their own DFA). A dish "passes" if either critic gives it a thumbs up. The Cartesian product tracks both critics' opinions at every course, and the final verdict is "pass if at least one critic approved."

Two judges evaluate the same routine while their scorecards combine into an OR verdict

Intersection comes free from complement and union

Here's a neat trick. We've shown closure under complementation and union. Can we get intersection for free?

Yes — via De Morgan's law:

Loading diagram...

A₁ ∩ A₂ = complement of (complement(A₁) ∪ complement(A₂))

Since we know regular languages are closed under complement and union, we can chain the operations:

Complement A₁ → regular (closure under complement)
Complement A₂ → regular (closure under complement)
Take their union → regular (closure under union)
Complement the result → regular (closure under complement)

Each step preserves regularity, so the final result is regular. Intersection is closed.

A chalkboard Venn diagram shows intersection through De Morgan's law

You could also prove this directly by modifying the Cartesian product construction — just change the accept condition from "either accepts" to "both accept": F = {(r,s) : r ∈ F₁ AND s ∈ F₂}. That gives you intersection instead of union, using the same parallel-simulation idea.

The proof structure is always the same

Every closure proof follows this pattern:

Loading diagram...

A five-step closure proof recipe card is checked off on a corkboard

Step 3 is the creative part — the construction. Step 4 is the rigorous part — the correctness proof. For complement, the construction is trivial (flip accept states) and the proof is a simple bidirectional argument. For union, the construction is the Cartesian product and the proof tracks how the parallel simulation preserves the accept conditions.

This "prove by constructing" approach is one of the hallmarks of computability theory. You don't just show something exists — you show how to build it. And that's deeply satisfying.

What's still missing

We've shown closure under complement, union, and intersection. But what about concatenation and Kleene star?

These are harder. The Cartesian product trick doesn't work for concatenation because you'd need to know where to split the input string between the two machines. A DFA reads left to right and can't go back — so it can't try different split points.

To handle concatenation and Kleene star, we'll need a new tool: nondeterminism. That's the topic of the next post.

A few things I'm taking away

The most effective strategy for building a DFA is assigning each state a role — a plain-language description of what that state "knows" about the input so far
Trap states absorb inputs that have permanently violated the pattern — they're necessary because every state must have a transition for every character
Substring matching in a DFA requires tracking partial progress and handling overlapping patterns carefully
A regular language is defined as any language that some DFA can recognize — this is our formal starting point
Closure under an operation means you can't escape the class by applying that operation — regular languages are remarkably self-contained
Complementing a DFA is trivial: same machine, flip the accept states — the proof is a clean bidirectional subset argument
The Cartesian product construction proves closure under union by running two DFAs in parallel and accepting if either would accept
Intersection comes free from De Morgan's law (or directly, by changing the Cartesian product's accept condition from "or" to "and")
Every closure proof follows the same structure: assume regularity → DFAs exist → construct new DFA → prove correctness
Concatenation and Kleene star can't be handled by the Cartesian product approach — they require nondeterminism, which is next

That last point is what I'm most looking forward to. We've been working entirely in the deterministic world — every state has exactly one transition per character, no ambiguity, no choices. Nondeterminism breaks that constraint, and it turns out to be the key that unlocks the remaining closure properties without making the proofs painfully complex.

Regular Languages Stay Regular No Matter What You Do to Them

Building a DFA starts with assigning each state a job

Flipping the question gives you a new DFA for free

Tracking a substring requires remembering progress

Regular languages are defined by what DFAs can recognize

Closure means you can't escape the class

Complement: just flip the accept states

Union: run two machines in parallel

The Cartesian product construction

A concrete example

Intersection comes free from complement and union

The proof structure is always the same

What's still missing

A few things I'm taking away

Comments