# Database Normalization

## Introduction

**Database Normalization** is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency.

The central concept in this discussion is the notion of Functional Dependency (*FDs*), which depends on the semantics of the Data and which deals with what Information in a relation is dependent on what other information in the relation.

We perform normalization on database to avoid redundancy and associated problems and anomalies that may occur. The data redundancies yield the following anomalies

**1**. __Update Anomaly__ If one copy of redundant data is updated, an inconsistency is created unless all copies are similarly updated.**2**. __Insertion Anomaly__ It may not be possible to store some information unless some other key information is stored as well.**3**. __Deletion Anomaly__ It may not be possible to delete some information without loosing some other related information as well.

Normalization decomposes a Relation into two or more smaller relations. Normalization removes the anomalies in the database, but the data that could be retrieved from one relation will require several relations to be joined after normalization.

## Functional Dependency

There are several stages of the normalization process. They are called 1NF, 2NF, 3NF, EKNF, BCNF, 4NF, 5NF, DKNF, 6NF. The higher normal forms is cumulative of the lower normal forms; *for example*, BCNF is always in 3NF which also satisfies to be in 2NF and 1NF.

A function or mapping **ƒ** from set A to a set B, denoted ƒ:A→B, is a correspondence in which to each element *x* of A *(domain)* corresponds exactly one element *y=f(x)* of B *(range)*. We use the notation x-->y, saying *x determines y* or *y is determined by x*. This states that, In a legal relation state (*relational instance*) of R, for any value of *x* there relates to only a single definite value of *y*, this is known as *Single Valued Dependency*. Functional Dependencies define constraints on the set of legal relations.

A functional dependency is **trivial** if it is satisfied by all instances of a relation, this means that A --> B holds, whenever B is a subset of A;*e.g.* FullName-->FirstName, where *FullName=FirstName+LastName*; or simply *X-->X.*

The Right Hand Side is __dependent__ on the Left Hand Side (**determinant**) of the FD. Sets of functional dependencies may have redundant dependencies that can be inferred from the others. Intuitively, a canonical cover of F is a minimal set of FD equivalent to F^{+}, having no redundant dependencies or redundant parts of dependencies. Clearly F^{+} is a superset of F.

## all about Keys and non-keys

**Super Key** is the set of attributes which when taken collectively can uniquely identify a tuple in a relation. This means the super-key has the *uniqueness property*.

A super-key also having the *irreducible property* is a **candidate key**, which implies that any proper-subset of the candidate key no longer remains the candidate key (*minimal superkey*).

One of the candidate key is chosen by the DBA as the principal key called the **Primary key**, the rest if any are the *alternate keys* or **secondary keys**. Primary key enforces the __Entity Integrity rule__ which states that no component of the primary key is allowed to accept NULL values.__Referential Integrity rule__ states that the database must not contain any unmatched foreign key. A **Foreign key** in a base relation R2, is a subset of the set of attributes of R2, say FK, such that:

There exists a base relation R1 not necessarily distinct with a candidate key CK and for every value of FK in the current value of R2 is identical to the value of CK in some tuple in the current value of R1.

Foreign keys may be referenced to any candidate keys either the primary key or unique keys.

The attributes which are a part of a key attribute is called the *prime attribute*, the rest which are not a part of any key are the *non-prime attributes*.

## Determine keys in R with F

We say that an attribute Y is functionally determined by X if *X→Y*. To test whether a set X is a key, we must compute the set of attributes functionally determined by X. We call the set of all attributes functionally determined by X under a set F of **FD** the *closure* of X under F, denoting this by X^{+}. If X^{+} contains all the attributes in R, or X→R, then X is a superkey in R. The steps in finding the keys in R with a given set of FD is :

__Step 1__: Draw a dependency graph of F, each vertices corresponds to an attribute, and dependency by directed edges.__Step 2__: Identify the set of source-vertices V_{n} that have no incoming edges, and the set of sink-vertices V_{o} that have only incoming edges. (*V _{n}≈»indegree=0; V_{o}≈»outdegree=0*)

__Step 3__: Compute V

_{n}

^{+}. A

*candidate key*is a set of attributes that :

**1**. contains all the attributes in V

_{n}(

*attributes present only in the LHS of a FD*), V

_{n}belongs to the set of prime attributes

*.*

**2**. contains no attributes in V

_{o}(

*attributes present only in the RHS of a FD*), V

_{o}belongs to the set of nonprime attributes.

**3**. it is the minimal superkey, confirmed by the closure of remaining attributes in R.

## Finding Candidate Keys

## First Normal Form 1NF

A relation is in 1NF if all underlying domains contain (*indivisible*) __atomic values__ only.

There should not be any duplicate rows in the table

Each cell is single-valued (no repeating groups or arrays)

Entries in a column are of the same kind.

All non-key attributes are dependent on the key attributes

Clearly, if we have defined the key attributes and all the attributes in R are single valued, the relation R satisfies 1NF.

## Second Normal Form 2NF

A relation is in 2NF if it is in 1NF and every non-prime (*non-key*) attribute is fully dependent on the candidate keys of the relation. Clearly, a table is in 2NF if it satisfies 1NF and has no partial dependencies.

In this context, *fully dependent* means that, no proper-subset of the determinant can any longer determine the RHS of a FD; *partial dependencies* mean, any proper-subset of the determinant may determine the RHS of the FD.

A relation R which satisfies 1NF is not in 2NF only if a nonprime attribute is functionally dependent on any proper-subset of the composite key in R (*partial dependency exists*).

## Third Normal Form 3NF

A relation R is in 3NF if R is in 2NF and every non-key attribute of R is *non-transitively* dependent on each candidate key of R. All non-prime attributes in R are required to be directly dependent on every candidate key of the relation.

A relation is in 3NF only if whenever a nontrivial FD A-->B holds in R, either A (*LHS*) is a candidate key or B (*RHS*) is a prime attribute (*member of any candidate key*) of R.

A relation R which is in 2NF is not in 3NF only if there exists any FD, where a non-key attribute functionally determines another non-key attribute in R.

## Boyce Codd Normal Form BCNF

A relation is in BCNF if it is in 3NF and every determinant (*LHS of a FD*) is a candidate key. A relational schema R is in BCNF, if whenever a nontrivial functional dependency holds in R, then the determinant must be a key of R.

BCNF does not allow non-key attribute to functionally determine a prime attribute, which was relaxed in 3NF.

A relation in 3NF is not in BCNF, only if all the following satisfies **1**. there are more than one candidate keys, **2**. the keys are composite and overlapping, and **3**. there exists a FD from a non-overlapping attribute of one key to the non overlapping attribute of another key.

## MultiValued Dependencies and 4NF

We have worked with single valued dependency; while Multivalued Dependency A-->>B defines a relationship in which a definite set of values of attribute B are determined by a single value of A. This means for any value of A instead of having a single value of B we have a definite set of values for B. Functional Dependencies sometimes are referred to as equality-generating dependencies, and Multivalued Dependencies are referred to as tuple-generating dependencies.

More formally, a MVD X-->>Y is called trivial if either *Y is a subset of X *or* X union Y equals R*. It is to be noted that every FD is also an MVD, *since A-->B implies A-->{B,Φ}*.

In 4NF we consider the __non-trivial MVD__; A relation R to be in 4NF, if whenever a MVD holds X-->>Y then either the dependency is trivial or X is a candidate key for R.

A relation which is in BCNF is not in 4NF, only if there exists two or more independent multivalued facts (MVD ≈ *one-to-many *or* many-to-many *relationships) about an entity. So, the relation has three or more attributes in R, based on the non-trivial multivalued dependency X-->>Y (*X multidetermines Y*) we decompose *R* into *R1=(X U Y)* and *R2=(R-Y)*.

## Fifth Normal Form 5NF

5NF sometimes called projection-join normal form (**PJNF**) where every non-trivial join dependency in the table is implied by the superkeys of the table.

Projection is the process of separating one relation into sub-relations and Join is the process of consolidating these sub-relations back into one relation. Sometimes, this join and projection operation produces *spurious tuples* because of join dependencies, *we get more not less*. The key objective of 5NF is to remove the join dependencies.

A Join Dependency JD denoted as JD(R1,R2) is equivalent to the MVD R1∩R2-->>(R1-R2) *symmetrically R1∩R2-->>(R2-R1)*. Conversely, every MVD A-->>B is also a JD *[A U (R-B), A U B].

In 5NF we consider the non-additive join decomposition making sure that Lossless-Join decomposition can be achieved. 5NF does not differ from 4NF unless there exists a symmetric constraint such as a rule about the correspondence between the attributes in R (*cyclic dependencies*).

## Domain/key normal form and 6NF

**Domain/Key normal form** requires every constraint on the table is a *logical consequence* of the table's __domain__ constraints and __key__ constraints. DKNF enforces database constraints by checking that each attribute in a tuple is of the appropriate domain and the key constraints are enforced.**DKNF** defined by Ronald Fagin, theoretically known as the *ultimate normal form*; later C.J. Date with Hugh Darwen, and Nikos Lorentzos defined the **Sixth Normal form** which requires that the table features no non-trivial join dependencies at all (*with reference to generalized join operator*). This implies that a Relvar R of degree *n* is in 6NF, iff it is in 5NF and it has no key of degree less than *n-1*.

## Normalization Theory

## Goals of Normalization

Let *R* be a relation schema with the set* F* of functional dependencies.

- each relation scheme should be in a good form.
- the decomposition is a lossless-join decomposition.
- Preferably, the decomposition is dependency preserving.

A relation R is said to be lossless decomposition into R1 and R2, if only the natural-join of these sub relations produce the original relation without creating additional spurious tuples.

Let, R be decomposed into R1 and R2 (*binary decomposition*),

If either *R1 ∩ R2 → R1* or *R1 ∩ R2 → R2* holds then

the decomposition is loss-less else it is *undesired* lossy.

It is not necessary nor suggested to normalize our database to the highest normal form possible, I consider 3NF sufficient for most database systems. Normalization has the disadvantages that it may require complex join to query the database and make the programmer task harder in extra coding and effort. Although, we have a solution of using *materialized view*, we may want to use non-normalized schema for performance. The main idea of normalization is to decompose the relation into a number of *single-theme tables* removing the database anomalies and problems of data redundancy.

**© 2012 Dipankar Basu**