Database Normalization (1NF, 2NF, 3NF)
What You'll Learn
By the end of this page, you will be able to:
- Explain why normalization matters — what problems it solves
- Define 1NF, 2NF, and 3NF with their exact requirements
- Transform an unnormalized table step-by-step through each normal form
- Identify functional dependencies and transitive dependencies
- Know when to denormalize for performance
The Problem: Redundant & Inconsistent Data
Imagine a single table storing everything about orders:
| order_id | customer_name | customer_email | product_names | quantities | prices |
|---|---|---|---|---|---|
| 1 | Alice | alice@mail.com | Laptop, Mouse | 1, 2 | 999, 25 |
| 2 | Alice | alice@mail.com | Keyboard | 1 | 79 |
Problems:
- Update anomaly — Alice's email is duplicated. Updating it in one row but not the other causes inconsistency.
- Insert anomaly — You cannot add a new customer who hasn't placed an order yet.
- Delete anomaly — Deleting order 2 deletes Alice's only record, losing her contact info entirely.
Normalization is the process of structuring a database to eliminate these anomalies by splitting tables and defining relationships.
Key Concepts
Functional Dependency
A functional dependency exists when the value of one attribute determines the value of another.
- Notation:
A → Bmeans "A determines B" - Example:
student_id → student_name(knowing the student ID uniquely determines the name)
Partial Dependency
A partial dependency occurs when a non-key attribute depends on only part of a composite primary key, not the whole key.
Transitive Dependency
A transitive dependency occurs when A → B and B → C, meaning A → C indirectly. The non-key attribute C depends on another non-key attribute B rather than directly on the primary key.
First Normal Form (1NF)
Rules
- Atomic values — each cell contains a single, indivisible value (no lists, arrays, or comma-separated values)
- No repeating groups — each column has a unique name; no duplicate columns like
phone1,phone2 - Unique rows — each row is distinguishable (typically via a primary key)
Before 1NF (violations)
| order_id | customer_name | products |
|---|---|---|
| 1 | Alice | Laptop, Mouse, Keyboard |
| 2 | Bob | Monitor |
Violations: products contains multiple values in a single cell.
After 1NF
| order_id | customer_name | product |
|---|---|---|
| 1 | Alice | Laptop |
| 1 | Alice | Mouse |
| 1 | Alice | Keyboard |
| 2 | Bob | Monitor |
Now every cell is atomic. But we introduced a new problem: customer_name repeats for every product in the same order — that's redundancy, addressed by 2NF.
Second Normal Form (2NF)
Rules
- Must be in 1NF
- No partial dependencies — every non-key attribute must depend on the entire primary key, not just part of it
2NF only applies to tables with composite primary keys. If your table has a single-column primary key, it is automatically in 2NF (as long as it's in 1NF).
Before 2NF (violations)
Consider a table tracking which student takes which course, with the instructor for each course:
| student_id | course_id | student_name | course_name | instructor |
|---|---|---|---|---|
| 1 | CS101 | Alice | Intro to CS | Prof. Smith |
| 1 | MATH200 | Alice | Calculus II | Prof. Jones |
| 2 | CS101 | Bob | Intro to CS | Prof. Smith |
Composite PK: (student_id, course_id)
Dependencies:
(student_id, course_id) → instructor✅ full dependencystudent_id → student_name❌ partial dependency — depends on only part of the keycourse_id → course_name, instructor❌ partial dependency — depends on only part of the key
After 2NF — Split into three tables
enrollments (junction table):
| student_id | course_id |
|---|---|
| 1 | CS101 |
| 1 | MATH200 |
| 2 | CS101 |
students:
| student_id | student_name |
|---|---|
| 1 | Alice |
| 2 | Bob |
courses:
| course_id | course_name | instructor |
|---|---|---|
| CS101 | Intro to CS | Prof. Smith |
| MATH200 | Calculus II | Prof. Jones |
Now every non-key attribute depends on the whole primary key of its table.
Third Normal Form (3NF)
Rules
- Must be in 2NF
- No transitive dependencies — non-key attributes must depend on the primary key directly, not through another non-key attribute
Before 3NF (violations)
| emp_id | emp_name | dept_id | dept_name | dept_location |
|---|---|---|---|---|
| 1 | Alice | D01 | Engineering | Floor 3 |
| 2 | Bob | D01 | Engineering | Floor 3 |
| 3 | Carol | D02 | Marketing | Floor 1 |
Dependencies:
emp_id → emp_name, dept_id, dept_name, dept_location✅dept_id → dept_name, dept_location❌ transitive dependency- So:
emp_id → dept_id → dept_name, dept_location—dept_nameanddept_locationdepend ondept_id, not directly onemp_id
This causes: if Engineering moves to Floor 5, you must update every employee in that department — update anomaly.
After 3NF — Split into two tables
employees:
| emp_id | emp_name | dept_id |
|---|---|---|
| 1 | Alice | D01 |
| 2 | Bob | D01 |
| 3 | Carol | D02 |
departments:
| dept_id | dept_name | dept_location |
|---|---|---|
| D01 | Engineering | Floor 3 |
| D02 | Marketing | Floor 1 |
Now dept_name and dept_location depend only on dept_id (the primary key of departments), not transitively through emp_id.
Summary: 1NF → 2NF → 3NF
| Normal Form | Requirement | Eliminates |
|---|---|---|
| 1NF | Atomic values, no repeating groups | Multi-valued cells, duplicate columns |
| 2NF | 1NF + no partial dependencies | Redundancy from composite key subsets |
| 3NF | 2NF + no transitive dependencies | Redundancy from non-key → non-key dependencies |
Quick check for each form:
- 1NF: Is every cell a single value? Are there no repeating groups?
- 2NF: Does every non-key column depend on the whole primary key? (Only relevant for composite keys)
- 3NF: Does every non-key column depend on the primary key directly, not through another non-key column?
Full Example: Step-by-Step Transformation
Unnormalized table
| order_id | date | customer_id | customer_name | customer_city | product_id | product_name | quantity | unit_price |
|---|---|---|---|---|---|---|---|---|
| 1001 | 2024-01-15 | C50 | Alice | Hanoi | P01 | Laptop | 1 | 999 |
| 1001 | 2024-01-15 | C50 | Alice | Hanoi | P02 | Mouse | 2 | 25 |
| 1002 | 2024-01-16 | C51 | Bob | Saigon | P01 | Laptop | 1 | 999 |
After 1NF — already atomic ✅
Every cell is a single value. No repeating groups.
After 2NF — check partial dependencies
PK: (order_id, product_id) — a composite key.
order_id → date, customer_id, customer_name, customer_city❌ partial — split outordersproduct_id → product_name, unit_price❌ partial — split outproducts(order_id, product_id) → quantity✅ full — stays inorder_items
Result: orders, order_items, products
After 3NF — check transitive dependencies
In the orders table: customer_id → customer_name, customer_city — customer_name and customer_city depend on customer_id, not directly on order_id. This is a transitive dependency.
Result: split orders into orders and customers.
Final normalized schema (3NF)
customers:
| customer_id | customer_name | customer_city |
|---|---|---|
| C50 | Alice | Hanoi |
| C51 | Bob | Saigon |
orders:
| order_id | order_date | customer_id |
|---|---|---|
| 1001 | 2024-01-15 | C50 |
| 1002 | 2024-01-16 | C51 |
products:
| product_id | product_name | unit_price |
|---|---|---|
| P01 | Laptop | 999 |
| P02 | Mouse | 25 |
order_items:
| order_id | product_id | quantity |
|---|---|---|
| 1001 | P01 | 1 |
| 1001 | P02 | 2 |
| 1002 | P01 | 1 |
When to Denormalize
Normalized schemas minimize redundancy but can increase JOIN complexity and query latency. In practice, denormalization is common for:
| Scenario | What to do |
|---|---|
| Read-heavy workloads | Duplicate frequently accessed columns to avoid JOINs |
| Reporting / OLAP | Pre-aggregate data into summary tables or materialized views |
| High-latency JOINs | Embed related data (e.g., store customer_name in orders for fast display) |
| Distributed databases | Co-locate related data to minimize cross-node queries |
Normalize first, denormalize deliberately. Start with 3NF as your baseline. Only denormalize when you have a measurable performance problem and a clear benefit.
Common Pitfalls
- Stopping at 1NF — multi-valued cells "work" initially but cause parsing nightmares and prevent indexing
- Over-normalizing — splitting into too many tiny tables makes queries unreadably complex with excessive JOINs
- Forgetting to index foreign keys — normalized schemas rely heavily on JOINs; missing indexes on foreign keys kills performance
- Denormalizing too early — premature denormalization introduces the very anomalies normalization was designed to prevent
- Ignoring business rules — normalization is a structural tool; complex business rules may still require application-level validation
Interview Questions
1. What is database normalization and why is it important?
Normalization is the process of organizing a database into tables and columns to minimize redundancy and eliminate data anomalies (insert, update, delete). It ensures each piece of data is stored in exactly one place, making the database easier to maintain and more consistent.
2. Explain 1NF, 2NF, and 3NF in simple terms.
| Normal Form | In Simple Terms |
|---|---|
| 1NF | One value per cell. No lists or comma-separated values in a column. |
| 2NF | Every non-key column depends on the whole primary key. (No partial dependencies.) |
| 3NF | Every non-key column depends only on the primary key. (No transitive dependencies.) |
A common mnemonic: "Every non-key attribute must provide a fact about the key (1NF), the whole key (2NF), and nothing but the key (3NF)."
3. What is the difference between a partial dependency and a transitive dependency?
- Partial dependency: a non-key attribute depends on part of a composite key (e.g.,
student_namedepends only onstudent_id, not on(student_id, course_id)). Relevant only for composite primary keys. - Transitive dependency: a non-key attribute depends on another non-key attribute (e.g.,
dept_namedepends ondept_id, which depends onemp_id). Can exist even with single-column primary keys.
4. Is a table with a single-column primary key automatically in 2NF?
Yes. 2NF requires eliminating partial dependencies — where a non-key column depends on only part of the primary key. If the primary key is a single column, there is no "part of it" to partially depend on, so no partial dependencies can exist.
5. When would you denormalize a database?
Denormalize when query performance is more important than storage efficiency:
- Read-heavy systems where JOINs are a bottleneck
- Reporting and analytics dashboards that need pre-aggregated data
- CQRS architectures where read models are separate from write models
- Distributed databases where cross-node JOINs are expensive
Always normalize first, then denormalize with a specific performance goal in mind.
6. What anomalies does normalization prevent?
| Anomaly | Description | Example |
|---|---|---|
| Insert anomaly | Cannot insert a record without unrelated data | Cannot add a new department until an employee is assigned to it |
| Update anomaly | Updating one value requires updating multiple rows | Changing a customer's email requires updating every order row |
| Delete anomaly | Deleting one record unintentionally removes other data | Deleting the last order for a customer removes the customer entirely |
Learn More
- Primary Key vs Foreign Key — how keys enforce relationships between normalized tables
- ACID Properties — how transactions guarantee consistency in normalized databases
- SQL Joins — querying across normalized tables
- Database Indexing — speeding up JOINs between normalized tables