MODULE 4
4.1 Introduction
Measure Of Quality
We can
discuss the goodness of relation schema in two levels.
1.Logical Level
All you know, logical level represent the
middle level in three level architecture of DBMS. The logical level describes
how users interpret the relation schema and the meaning of their attributes.
Having good relation schema at this level enables users to understand clearly
the meaning of data in relations and hence to formulate the queries correctly.
2 Implementation
Level
It is the
lowermost level in the DBMS architecture, which describes how the tuples in
base relations are stored and updated. This level applies only to the storage
level in the database. But the former logical level applies to both view level
and logical level. And the database becomes effective as much as with the
effective storage.
Database
Design Techniques
Generally we
can design the database in two different approaches.
1.
Top-Down Design (Analysis) methodology
It starts with the major entities of
their interest, their attributes and their relationships. And then we add other
entities and may split these entities into a number of specialized entities and
add the relationships between these entities.
2.
Bottom-Up Design
It starts with a set of attributes.
And group these attributes into entities. Then find out the relationship
between these entities. Identify the higher-level entities, generalized these
entities and locate relationships at this higher level.
Problems with bad schema
1. Redundant
storage of data:
2. Wastage of disc
space
3. More running
time
Informal Design Guidelines for relation schema
There are four
informal measures of quality for relation schema design.
1.
Semantics of the relation attribute
For all attributes belonging to a
relation will have certain real world meaning and a proper interpretation
associated with them. The semantics specifies how to interpret the attribute
values stored in a tuple of the relation.
Guideline:-Design
relation schema so that it is easy to explain its meaning. Do not combine the
attributes from multiple entity types and relationship types into a single
relation. Otherwise if the relation contains multiple entities, the semantics
will be unambiguous and the relation cannot be easily explained.
2.
Reducing the redundant values in tuples and update
anomalies
One goal of schema design is to
minimize the storage space that the space relations occupy. The anomalies that
may be present in the database relation can be classified into three categories.
1.
Insertion anomalies
2.
Deletion anomalies
3.
Modification anomalies
Guideline: - Design
relation schema so that there is no insertion, deletion and modification
anomalies are present in relations. If any anomalies are present note them
clearly and make sure that the programs that update the database will operate
correctly.
3.
Null values in tuples
The null value in tuples is wastage
of storage space in storage level and may also lead to problems with
understanding the meaning of attributes and with specifying join operations at
the logical level. Another problem is how to account the aggregate functions in
null valued attributes.
Guideline: - As far
possible, avoid placing attributes in base relation whose values frequently be
null. If nulls are unavoidable, make sure that they are applicable for
exceptional cases only.
4.3 Constraints
Constraints on database can
generally divided into four main categories.
1.
Inherent model based: Constraints that are inherent to
the data model called inherent model based constraint like, a relation cannot
have duplicate tuples.
2.
Schema Based: Constraints that can be directly
expressed in the schemas of the data model, typically specifying in DDL are
called Schema based constraint.
3.
Application Based:
Schemas expressed by the application programs are called
application-based constraints.
4.
Data Dependency: It is the constraint that is related
to the dependency between the values in the relation.
Now we can go through the details of
schema-based constraints. These schema based constraints are expressed in
relational model. It includes five basic constraints.
1.
Domain constraint
2.
Key constraint
3.
Entity integrity constraint
4.
Referential integrity constraint
5. Constraint on nulls
4.4 Domain constraint
A domain D represents a set of atomic
values. The data type describing the type of values that can appear in each
column is represented by this domain. i.e each value in the domain indivisible
as far as the relational model is concerned. Specifying the data type from
which the data values specifying the domain are drawn specifies the domain. A
domain is given a name, data type and format.
4.5 Entity-Integrity constraint
Entity integrity constraint states that no
primary key value can be null. Primary key value is used to identify tuples.
Having null value in primary key implies we cannot identify tuples. The key
constraint and integrity constraint are specified on individual relations.
4.6 Referential integrity constraint
The Referential integrity constraint is
specified between two relations and is used to maintain the consistency among
tuples in two relations. Referential integrity constraint states that a tuple
in one relation that refer to another relation must refer to an existing tuple
in that relation. To define Referential integrity constraint first we have to
define the concept of foreign key (FK).
A set of attributes FK in relation schema R1,
is a foreign key of R1 that references the R2 relation if it satisfies the
following two rules.
1. The attributes in FK have the same domain
as the PK attributes of R2. The attributes FK are said to reference or refer to
the relation R2.
2. A value of FK in tuple t1 of the current
state r1(R) either occurs as a value of PK for some tuple t2 in the current
state r2(R) or is null.
If
t1[FK]=t2[PK] then we can say
that the tuple t1 refers to the tuple t2.
R1
is referencing relation
R2
is referenced relation.
Key constraint
A relation is defined as a set of tuples. By definition of a relation
all the tuples in a relation are distinct. i.e no two tuples can have the same
value for all their attributes. There are some subsets of relation schema R
with the property that no two tuples in any relation state ‘r ‘ of ‘R’ should have the same value for these
attributes.
A key K of a relation schema R is a super
key of R with the additional property
that removing any attribute A from K
leaves a set of attributes K’ that is not a super key of R any more. Hence key
satisfies the following two constraints.
1. The two distinct tuples in any state of
the relation cannot have identical values for all the attributes in the key.
2. It is a minimal super key. i.e from a
super key we cannot remove any attribute and still have the uniqueness constraint
to hold the first condition.
Null Constraint
It specifies whether the null values are
permitted to an attribute in a database.
4.7 Functional
Dependency (FD)
Functional dependency is a constraint between
two sets of attributes from the database.
Definition: A FD denoted by X→ Y
between two sets of attributes X and Y that are subsets of R specifies a constraint on
the possible tuples that can form a relation state r of R. The
constraint is that, for any two tuples t1 and t2 in r that
have t1[X] =t2[X], they must also have t1[Y] =t2[Y]. i. e. the
values of the Y component of a tuple in r depends on the values
of X component or the X component determines the value of Y component.
i.e
1. The constraint on R states that there
cannot be more than one tuple with a given X value in any relation state r(R) ╡
X→ Y for any subset of attributes Y of R.
2. If X→ Y in R doesn’t say whether or
not Y→ X in R.
A functional dependency FD: is called trivial if Y is a subset of
X.
Definition: A functional
dependency, denoted by X→Y, between two sets of attributes X and Y that are
subsets of the attributes of relation R, specifies that the values in a tuple
corresponding to the attributes in Y are uniquely determined by the values
corresponding to the attributes in X.
For example, the social
security number uniquely determines a name; SSN→ Name
Functional dependencies are
determined by the semantics of the relation, in general, they cannot be
determined by inspection of an instance of the relation. That is, a functional
dependency is a constraint and not a property derived from a relation.
Inference rules
Armstrong's axioms - sound and complete i.e, enable the computation of
any functional dependency.
Functional dependencies are
1. Reflexivity - if the B's are a subset of the A's then A →
B
2. Augmentation - If A → B, then A, C → B, C.
3. Transitivity - If A → B and B → C then A → C.
Additional inference rules
4. Decomposition - If A → B, C then A → B
5. Union - If A → B and A →
C then A → B, C
6. Pseudo transitive - If A → B and C, B → D then C, A → D
Equivalence of sets of functional
dependencies
Two functional dependencies S & T are equivalent iff S→ T and T → S.
The dependency {A_1, ..., A_n} → {B_1, ..., B_m}
is trivial if the
B's are a subset of the A's
is nontrivial if at
least one of the B's is not among the A's
is completely
nontrivial if none of the B's is also one of the A's
Closure (F+)
All dependencies that include F and that can be inferred from F using the
above rules are called closure of F denoted by F+.
Algorithm to compute closure
We have to find out whether F╞ X → Y. This is the case when X → Y Є F+
.For the better method is to generate X+, closure of X under
Fand test F╞ X → Y using the first two axioms augmentation and reflexive rules.
Algorithm:
Input: A set of FD ‘s F and asset of attributes X.
Output: The closure X+ of X under the FD’s F+.
X+:=X;
Change=true;
While change do
Begin
Change: = False;
For each FD W → Z in F do
Begin
If W Í Z then do
Begin
X+:
= X+UZ;
Change:
=True;
End;
End;
End;
4.8 Normalization
In relational database theory, normalization is the process of
restructuring the logical data model of a database to eliminate redundancy,
organize data efficiently, reduce repeating data and to reduce the potential
for anomalies during data operations. Data normalization also may improve data
consistency and simplify future extension of the logical data model. The formal
classifications used for describing a relational database's level of
normalization are called normal forms (NF).
A non-normalized database can suffer from data anomalies:
A non-normalized database may store data representing a particular
referent in multiple locations. An update to such data in some but not all of
those locations results in an update anomaly, yielding inconsistent data. A
normalized database prevents such an anomaly by storing such data (i.e. data
other than primary keys) in only one location.
A non-normalized database may have inappropriate dependencies, i.e.
relationships between data with no functional dependencies. Adding data to such
a database may require first adding the unrelated dependency. A normalized
database prevents such insertion anomalies by ensuring that database relations
mirror functional dependencies.
Similarly, such dependencies in non-normalized databases can hinder deletion.
That is, deleting data from such databases may require deleting data from the
inappropriate dependency. A normalized database prevents such deletion
anomalies by ensuring that all records are uniquely identifiable and contain no
extraneous information.
4.9 Normal forms
Edgar F. Codd originally defined the first three normal
The first normal form requires that tables be made up of a primary key
and a number of atomic fields, and the second and third deal with the
relationship of non-key fields to the primary key. These have been summarized
as requiring that all non-key fields be dependent on "the key, the whole
key and nothing but the key". In practice, most applications in 3NF are
fully normalized. However, research has identified potential update anomalies
in 3NF databases. BCNF is a further refinement of 3NF that attempts to
eliminate such anomalies.
The fourth and fifth normal forms (4NF and 5NF) deal specifically with
the representation of many-many and one-many relationships. Sixth normal form
(6NF) only applies to temporal databases.
4.10. First normal form (1NF)
First normal form (1NF) lays
the groundwork for an organized database design:
Ensure that each table has a
primary key: minimal set of attributes which can uniquely identify a
record. It states that the domain of an
attribute must include only atomic values and the value of any attribute in a
tuple must be single value from the domain of that attribute. It doesn’t allow
nested relation. Data that is redundantly duplicated across multiple rows of a
table is moved out to a separate table.
Atomicity: Each attribute must contain a single value, not a set of
values.
Eg: Consider a Relation Person. The person will have the attributes SSN,
Name, Age, Address and College_Degree.
Person
Table
4.1
Now we can analyze this relation. Now check what are the possible values
of each attributes. Here SSN and Age will have only one value for a person. But
The college_Degree will have more than one value. And the address and Name of
person can be divided into more than one attributes. Hence this relation is not
in 1NF. So let us change this relation schema into 1NF by dividing this
relation into two relations.
Name→ FName, MInit, LName
Address→ ApartmentNo, City
Person_Residence
Table
4.2
College_Degree
Table
4.3
4.11Second normal form
(2NF)
First, the table must be in 1NF,
plus, we want to make sure that every Non-Primary-Key attribute (field) is
fully functionally dependent upon the ENTIRE Primary-Key for its
existence. This rule ONLY applies when you have a multi-part
(concatenated) Primary Key (PK).
It requires that data stored in a table with a
composite primary key must not be dependent on only part of the table's primary
key. And the database must meet all the requirements of the first normal form.
Take each non-key field, and ask this question:
If I knew part of the PK, could I tell what the non-key field would be.
Inventory
Table4.4
In this inventory table, Description
combined with Supplier is our PK. This is because we have two of the same
product that come from different suppliers.
There
are two non-key fields. So, we can ask the questions:
If we know just Description, can we find
out Cost? No, because we have more than one supplier for the same
product.
If we know just Supplier, and we find out Cost? No, because we need to know what the Item is as well.
If we know just Supplier, and we find out Cost? No, because we need to know what the Item is as well.
Therefore,
Cost is fully, functionally dependent upon the ENTIRE PK (Description-Supplier) for its
existence.
If we know just Description, can we
find out Supplier Address? No, because we have more than one supplier for
the same product.
If we know just Supplier, Can we find out Supplier Address? Yes. The Address does not depend upon the Description of the item.
If we know just Supplier, Can we find out Supplier Address? Yes. The Address does not depend upon the Description of the item.
Therefore,
Supplier Address is NOT functionally dependent upon the ENTIRE PK (Description-Supplier) for its
existence.
We
must get rid of Supplier Address from this table.
Inventory
Table
4.5
Supplier
Table 4.6
At this point, since it is the
"Supplier" table, we can rename the "Supplier" filed to
"Name." Name is the PK for this new table.
General Definition:
A relation schema R is
in second normal form (2NF) if every nonprime attribute A in R is not partially
dependent on any key of R.
4.12 Third normal form (3NF)
For 3NF, first, the table must be in
2NF, plus, we want to make sure that the non-key fields are dependent upon ONLY
the PK, and not on any other field in the table. This is very similar to
2NF, except that now you are comparing the non-key fields to OTHER non-key
fields.
For database to be in third normal form
1. The database must meet all the requirements of the second normal form.
2. Any field which is dependent not only on the primary key but also on
another field is moved out to a separate table.
Book
Table 4.7
Again, just ask
the questions:
If I know # of
Pages, can I find out Author's Name? No. Can I find out Author's
affiliation No? No.
If I know
Author's Name, can I find out # of Pages? No. Can I find out
Author's affiliation No? YES.
Therefore,
Author's affiliation No is functionally dependent upon Author's Name, not the
PK for its existence.
|
Book
Table 4.8
Author
Table 4.9
General Definition:
A relation schema R is
in 3NF if, whenever a nontrivial functional dependency X→A holds in R,
Either a) X is a Superkey
Or b) Y is a prime attribute of R.
i.e. A relation schema R is in 3NF if every nonprime attribute of R meets
both of the following terms:
1. It is fully
functionally dependent on every key of R.
2. It is nontransitively
dependent on every key of R.
4.13 Boyce-Codd normal form (BCNF)
A row is in BCNF if and only if
every determinant is a candidate key.
The second and third
normal forms assume that all attributes not part of the candidate keys depend
on the candidate keys but does not deal with dependencies within the keys. BCNF deals with such
dependencies.
A relation R is said
to be in BCNF if whenever X -> A holds in R, and A is not in X, then
X is a candidate key for R.
BCNF covers very
specific situations where 3NF misses interdependencies between non key
attributes. It should be noted that most relations that are in 3NF are also in
BCNF. Infrequently, a 3NF relation is not in BCNF and this happens only if
(a) the candidate keys in the relation
are composite keys (that is, they are not single attributes),
(b) there is more than one candidate key in the relation, and
(c) the keys are not disjoint, that is, some attributes in the keys are common.
(b) there is more than one candidate key in the relation, and
(c) the keys are not disjoint, that is, some attributes in the keys are common.
The BCNF differs from the 3NF only when
there are more than one candidate keys and the keys are composite and
overlapping. Consider for example, the relationship
enrol (sno, sname, cno, cname,
date-enrolled)
Let us assume that the relation has the
following candidate keys:
(sno, cno)
(sno, cname)
(sname, cno)
(sname, cname)
(sno, cname)
(sname, cno)
(sname, cname)
(we have assumed sname and cname
are unique identifiers). The relation is in 3NF but not in BCNF because there
are dependencies
sno -> sname
cno -> cname
cno -> cname
where
attributes that are part of a candidate key are dependent on part of another
candidate key. Such dependencies indicate that although the relation is about
some entity or association that is identified by the candidate keys
e.g.
(sno, cno), there are attributes that are not about the whole thing
that the keys identify. For example, the above relation is about an association
(enrolment) between students and subjects and therefore the relation needs to
include only one identifier to identify students and one identifier to identify
subjects. Providing two identifiers about students (sno, sname) and
two keys about subjects (cno, cname) means that some information about
students and subjects that is not needed is being provided. This provision of
information will result in repetition of information and the anomalies. If we
wish to include further information about students and courses in the database,
it should not be done by putting the information in the present relation but by
creating new relations that represent information about entities student
and subject.
These difficulties may be overcome by
decomposing the above relation in the following three relations:
(sno, sname)
(cno, cname)
(sno, cno, date-of-enrolment)
(cno, cname)
(sno, cno, date-of-enrolment)
We now have a relation that only has
information about students, another only about subjects and the third only
about enrolments. All the anomalies and repetition of information have been removed.
4.14 Multivalued Dependency and
Fourth normal form
In a relational
model, if all of the information about an entity is to be represented in one
relation, it will be necessary to repeat all the information other than the
multivalue attribute value to represent all the information that we wish to
represent. This results in many tuples about the same instance of the entity in
the relation and the relation having a composite key (the entity id and the
mutlivalued attribute). Of course the other option suggested was to represent
this multivalue information in a separate relation. The situation of course
becomes much worse if an entity has more than one multivalued attributes and
these values are represented in one relation by a number of tuples for each
entity instance. The multivalued dependency relates to this problem when more
than one multivalued attributes exist. Consider the following relation that
represents an entity employee that has one mutlivalued attribute proj:
emp (e#, dept,
salary, proj)
We have so far
considered normalization based on functional dependencies; dependencies that
apply only to single-valued facts. For example, e# -> dept implies
only one dept value for each value of e#. Not all information
in a database is single-valued, for example, proj in an employee
relation may be the list of all projects that the employee is currently working
on. Although e# determines the list of all projects that an employee
is working on, e# -> proj is not a functional dependency.
We can more clear the multivalued
dependency by the following example.
programmer
(emp_name, qualifications, languages)
This relation
includes two multivalued attributes of entity programmer; qualifications
and languages. There are no functional dependencies.
The attributes
qualifications and languages are assumed independent of each other. If we were
to consider qualifications and languages separate entities,
we would have two relationships (one between employees and qualifications
and the other between employees and programming languages).
Both the above relationships are many-to-many i.e. one programmer could have
several qualifications and may know several programming languages. Also one
qualification may be obtained by several programmers and one programming
language may be known to many programmers.
Functional
dependency A -> B relates one value of A to one value of B
while multivalued dependency A ->> B defines a relationship in
which a set of values of attribute B are determined by a single value
of A.
Now, more formally, X
->> Y is said to hold for R(X, Y, Z) if t1 and t2
are two tuples in R that have the same values for attributes X
and therefore with t1[x] = t2[x] then R also contains tuples t3
and t4 (not necessarily distinct) such that
t1[x] = t2[x] = t3[x] = t4[x]
t3[Y] = t1[Y] and t3[Z] = t2[Z]
t4[Y] = t2[Y] and t4[Z] = t1[Z]
t3[Y] = t1[Y] and t3[Z] = t2[Z]
t4[Y] = t2[Y] and t4[Z] = t1[Z]
In other words if t1
and t2 are given by
t1 = [X, Y1, Z1], and
t2 = [X, Y2, Z2]
t2 = [X, Y2, Z2]
then there must be
tuples t3 and t4 such that
t3 = [X, Y1, Z2], and
t4 = [X, Y2, Z1]
t4 = [X, Y2, Z1]
We are therefore
insisting that every value of Y appears with every value of Z
to keep the relation instances consistent. In other words, the above conditions
insist that X alone determines Y and Z and there is no relationship
between Y and Z since Y and Z appear in
every possible pair and hence these pairings present no information and are of
no significance.
Fourth normal form
Fourth normal form (or 4NF) requires that there be no non-trivial
multivalued dependencies of attribute sets on something other than a superset
of a candidate key. A table is said to be in 4NF if and only if it is in the
BCNF and multivalued dependencies are functional dependencies. The 4NF removes
unwanted data structures: multivalued dependencies.
Definition: A relation schema R is in 4NF with respect to a set of
dependencies if, for every non trivial multivalued dependency X ->>Y in F+,
X is a superkey for R.
Properties Of Relational
Decompositions
Decomposition Property: A relation must satisfy the following two
properties during decomposition.
i. Lossless:- A lossless-join
dependency is a property of decomposition, which ensures that spurious rows are
generated when relations are united through a natural join operation. i.e. The
information in an instance r of R must be preserved in the instances r1,
r2, r3, …..rk
where ri =
∏Ri (r)
Decomposition is lossless with respect to a set of functional
dependencies F if, for every relation instance r on R satisfying F,
R=∏R1 (r) * ∏R2 (r) * . . . . . . . ∏Rn (r)
ii. Dependency Preserving
Property: - If a set of functional dependencies hold on R it should be
possible to enforce F by enforcing appropriate dependencies on each r1
.
Decomposition D= (R1, R2,
R3, ………, Rk) of schema R preserves a set of dependencies
F if,
(∏R1 (F) U ∏R2 (F) U . . . . . . . . . . . ∏Rn
(F)) +=F+
∏Ri(F) is the projection of F onto Ri.
i.e Any FD that logically follows from F must also logically follows from
the union of projection of F onto Ri ‘S . Then D is called
dependency preserving.
4.15 Join Dependency and Fifth Normal
Form
Join dependency is the
term used to indicate the property of a relation schema that cannot be
decomposed losslesly into two relation schema, but can be decomposed losslesly
into three or more simpler relation schema. It means that a table, after it has
been decomposed into three or more smaller tables must be capable of being
joined again on common keys to form the original table.
Fifth normal form
Fifth normal form (5NF and also PJ/NF) requires that there are no
non-trivial join dependencies that do not follow from the key constraints. A
table is said to be in the 5NF if and only if it is in 4NF and the candidate
keys imply every join dependency in it.
4.16 Pitfalls in Relational Database Design.
A bad
design may have several properties, including:
- Repetition of information.
- Inability to represent certain
information.
- Loss of information.
0 comments:
Post a Comment