Module 5
.
5.1Distributed database concepts
A
distributed computing system
consists of a number of processing elements that are interconnected by a
computer network and that co-operate in performing certain assigned tasks.
A distributed
database (DDB) is a collection of multiple logically interrelated databases
distributed over a computer network. A distributed
database management system (DDBMS)
is a software system that manages a distributed database while making the
distribution transparent to the user. At the physical hardware level, the
following main factors distinguish a DDBMS from a centralized system:
·
There are multiple computers called sites or nodes.
·
These sites must be connected by some type of communication network to transmit data
and commands among sites.
Parallel versus Distributed technology – There
are two main types of multiprocessor system architecture:
- Shared memory (tightly coupled) architecture: Multiple memory share secondary (disk) storage and also share primary memory.
- Shared disk (loosely coupled) architecture: Multiple processors share secondary (disk) storage but each has their own primary memory.
Database
management systems developed using the above types of architectures are termed parallel database management systems;
rather than DDBMS they utilize parallel processor technology. In another type
of architecture called shared nothing
architecture, every processor has its own primary and secondary (disk)
memory, no common memory exists and the processors communicate over a
high-speed interconnection network. Although the shared nothing architecture
resembles a distributed database computing environment, major differences exist
in the mode of operation. In shared nothing architecture, there is symmetry and
homogeneity of nodes; this is not true of the distributed database environment
where heterogeneity of nodes is very common.Advantages of Distributed Databases:-
1. Management of distributed data with
different levels of transparency:
Ideally, a DBMS should be distribution transparent in the sense of
hiding the details of where each file is physically stored within the system.
The following types of transparencies are possible:
·
Distribution or network transparency: This
refers to the freedom for the user from the operational details of the network.
It may be divided into location transparency and naming transparency. Location transparency refers to the
fact that the command used to perform a task is independent of the location of
data and the location of the system where the command was issued. Naming transparency implies that once a
name is specified, the named objects can be accessed unambiguously without
additional specification.
·
Replication transparency: Copies of data may be
stored at multiple sites for better availability, performance and reliability.
Replication transparency makes the user unaware of the existence of fragments.
·
Fragmentation transparency: Fragmentation makes
the user unaware of the existence of fragments.
2. Increased availability and reliability:
Reliability is defined as the probability that a system is running at a certain
time point. Availability is the probability that the system is continuously
available during a time interval. When the data and DBMS software are
distributed over several sites one site may fail while other sites continue to
operate. Only the data and software that exist at the failed state cannot be
accessed. This improves both reliability and availability.
3. Improved performance: A distributed
DBMS fragments the database by keeping the data closer to where it is needed
most. Data localization reduces the contention for CPU and I/O services and
simultaneously reduces access delays involved in wide area networks. When a large
database is distributed over multiple sites, smaller databases exist at each
site. As a result, local queries and transactions accessing data at a single
site have better performance because of the small local databases. Moreover, interquery and intraquery
parallelism can be achieved by executing multiple queries at different sites.
4. Easier expansion: In a distributed
environment, expansion of the system in terms of adding more data, increasing
database sizes or adding more processors is much easier.
Additional Functions of Distributed
Databases
1. Keeping track
of data: The ability to keep track
of the data distribution, fragmentation and replication by expanding the DBMS
catalog.
2. Distributed
Query processing: The ability to access remote sites and transmit queries and
data among the various sites via a communication network.
3. Distributed
transaction management: The ability to devise execution strategies for queries
and transactions that access data from more than one site and to synchronize the
access to distributed data and maintain integrity of the overall database.
4. Replicated
data management: The ability to decide which copy of a replicated data item to
access and to maintain the consistency of copies of a replicated data item.
5. Distributed
database recovery: The ability to recover from individual site crashes and from
new types of failures such as the failure of communication links.
6. Security:
Distributed transactions must be executed with the proper management of the
security of the data and the authorization/access privileges of users.
7. Distributed
directory (catalog) management: A directory contains information (metadata)
about data in the database.
5.2 Data Fragmentation
This
is the process of breaking up the database into logical units called fragments,
which may be assigned for storage at the various sites. There are mainly two
types of fragmentation:
- Horizontal fragmentation
- Vertical fragmentation
a) Horizontal fragmentation – A
horizontal fragment of a relation is a subset of the tuples in that relation.
The tuples that belong to the horizontal fragment are specified by a condition
on one or more attributes of the relation. Often, only a single attribute is
involved. Horizontal fragmentation divides a relation “horizontally” by
grouping rows to create subset of tuples, where each subset has a certain
logical meaning. These fragments can be assigned to different sites in the
distributed system. Derived horizontal
fragmentation applies the partitioning of a primary relation to other
secondary relations which are related to the primary via a foreign key. Each
horizontal fragment on a relation R can be specified by a σCi(R)
operation in the relational algebra. A set of horizontal fragments whose
conditions C1, C2, ……., Cn include all the tuples in R (i.e. every tuples in R
satisfies (C1 or C2 or …..or Cn)) is called a complete horizontal fragmentation
of R. In many cases, a complete horizontal fragmentation is also disjoint; i.e.
no tuple in R satisfies (Ci and Cj) for any i ≠ j.
b) Vertical fragmentation – Vertical
fragmentation divides a relation “vertically” by columns. A vertical fragment
of a relation keeps only certain attributes of the relation. It is necessary to
include the primary key or some candidate key attribute in every vertical
fragment so that the full relation can be reconstructed from the fragments. For
e.g.: Consider the schema Employee (Name, Bdate, Address, Sex, SSN, Salary,
DNo). We want to fragment this relation into 2 vertical fragments. The first
fragment includes personal information – Name, Address, Bdate and Sex – and the
second fragment includes work related information – SSN, Salary and DNo. This
fragmentation is not proper because, if the two fragments are stored separately
we cannot put the original employee tuples back together, since there is no
common attribute between the two fragments. Hence we must add SSN attribute to
the personal information fragment also. A vertical fragment on a relation R can
be specified by a ПLi(R) operation in the relational algebra. A set
of vertical fragments whose projection lists L1, L2, ……., Ln include all the
attributes in R but share only the primary key attribute of R is called a
complete vertical fragmentation of R. In this case, the projection lists
satisfy the following conditions:
1. L1 U L2 U…..U Ln = ATTRS(R)
2. Li ∩ Lj =
PK(R) for any i ≠ j, where ATTRS(R) is the set of attributes of R and PK(R) is
the primary key of R.
c) Mixed (Hybrid) fragmentations –
Mixed fragmentation is the combination of vertical fragmentation and horizontal
fragmentation. In general a fragment of a relation can be constructed by a
SELECT-PROJECT combination of operations ПL(σC(R)).
- If C = True and L ≠ ATTRS(R), we get a vertical fragment.
- If C ≠ True and L = ATTRS(R), we get a horizontal fragment.
- If C ≠ True and L ≠ ATTRS(R), we get a mixed fragment.
d) Fragmentation schema – A
fragmentation schema of a database is a definition of a set of fragments that
includes all attributes and tuples in the database and satisfies the condition
that the whole database can be reconstructed from the fragments by applying
some sequence of OUTER UNION and UNION
operations.
e) Allocation schema – An allocation
schema describes the allocation of fragments to sites of the DDBS; hence it is
a mapping that specifies for each fragment the site(s) at which it is stored.
5.3 Data Replication and Allocation
If a fragment is
stored at more than one site, it is said to be replicated.
a) Fully replicated distributed database – If
the replication of whole database is done at every site in the distributed
system, the resulting database is called a fully replicated distributed
database. This can improve availability remarkably because the system can
continue to operate as long as at least one site is up. It also improves
performance of retrieval for global queries. The disadvantage of full
replication is that it can slow down update operations drastically.
b) Nonredundant allocation – In this
system, each fragment is stored at exactly one site. In this case, all fragments
must be disjoint except for the repetition of primary keys among vertical (or
mixed) fragments.
c) Partial replication – In this
system, some fragments of the database may be replicated whereas others may
not. The number of copies of each fragment can range from one up to the total
number of sites in the distributed system.
d) Replication schema – A description
of the replication of fragments is called replication schema. Each fragment –
or each copy of a fragment – must be assigned to a particular site in the
distributed system. This process is called data
distribution or data allocation.
5.4 Types
of Distributed Database Systems
The
term distributed database management system can describe various systems that
dif fer from one another in many respects. The main thing that all such systems
have in com. mon is the fact that data and software are distributed over
multiple sites connected by some form of communication network.
The first factor we consider is the
qegree of homogen.eity of the DDBMS software. If all servers (or individual
local DBMSs) use identical soft.ware and all users (clients) use identical
software, the DDBMS is called homogeneous; 'otherwise, it is called heterogeneous.
Another factor related to the degree of homogeneity is the degree of local
auton. omy. If there is no provision for the local site to function as a
stand-alone DBMS, then the system has no local autonomy. On the other hand, if direct
access by local transactions to a server is permitted, the system has some
degree of local autonomy.
At one
extreme of the autonomy spectrum, we have a DDBMS that "looks like" a
centralized DBMS to the user. A single conceptual schema exists, and all access
to the system is obtained through a site that is part of the 'DDBMS--which
means that no local autonomv exists. At the other extreme we encounter a type
of DDBMS called a federated DDBMS (or a multidatabase system). In
such a system, each server is an fndependent and autonomous centralized DBMS
that has its own local users, local transactions, and DBA and hence has a very high
degree of local autonomy. The term federated database system (FOBS) is
used when there is some global view or schema of the federation of databases
that is shared by the applications. On the other hand, a multidatabase system
does not have a global schema and interactively constructs one as needed by the application. Both systems are hybrids
between distributed and centralized systems and the distinction we made between
them is not strictly followed. We will refer to them as FDBSs in a generic
sense.
In a heterogeneous FOBS,
one server may be a relational DBMS, another a network DBMS, and a third an
object or hierarchical DBMS; in such a case it is necessary to have a canonical system language and
to include language translators to translate subqueries nom the canonical
language to the language of each server. We briefly discuss the issues
affecting the design of FDBSs below.
Federated Database Management Systems Issues
. The type of
heterogeneity present inFDBSs may arise from several sources.
- .. Differences in data models: Databases in an organization come from a variety of data models including, the relational data model, the object data model, etc.The modeling capabilities of the models vary. Hence, to deal with them uniformly via a single global schema or to process them in a single language is challenging. Even if two databases are both from the RDBMS environment, the same information may be represented as an attribute name, as a relation name, or as a value in different databases. This calls for an intelligent query-processing mechanism that can relate information based on metadata.
·
. Differences in constraints: Constraint
facilities for specification and implementation vary from system to system-.
There are comparable features that must be reconciled in the construction of a
global schema. For example, the relationships from ER models are represented as
referential integrity constraints in the relational model. Triggers may have to
be used to implement certain constraints in the relational model. The global
schema must also deal with potential conflicts among constraints.
·
. Differences in query languages: Even
with the same data model, the languages and their versions vary. For example,
SQL has multiple versions like SQL-89, SQL-92 (SQL2), and SQL3, and each system
has its own set of data types, comparison operators, string manipulation
features, and so on.
Semantic Heterogeneity.
Semantic
heterogeneity occurs when there are differences in the meaning, interpretation,
and intended use of the same or related data. Semantic heterogeneity among
component database systems (OBSs) creates tne mggest hurdle in designing global
schemas of heterogeneous databases. The design autonQmy of component OBSs
refers to their freedom of choosing the following design parameters, which In
turn affect the eventual complexity of the FOBS:
·
. The universe of discourse from which
the data is drawn: For example, two customer accounts databases in
the federation may be from United States and Japan with entirely different sets
of attributes about customer accounts required by the accounting practices.
Currency rate fluctuations would also present a problem. Hence, relations in
these two databases which have identical nameS---CUSTOMER or ACCOUNT may have
some common and some entirely distinct information.
·
. Representation and naming: The
representation and naming of data elements and,the structure of the data model
may be prespecified for each local database.
·
. The understanding, meaning, and subjective
interpretation of data.
This
is a chief contributor to semantic heterogeneity
·
. Transaction,and policy
constraints: These deal with serializability criteria, compensating
transactions, and other transaction policies.
·
Derivation of summaries: Aggregation,
summarization, and other data-processing features and operations supported by
the system.
5.5
Query Processing in Distributed Databases
·
Data Transfer Costs of Distributed
Query Processing
In a distributed system,
several additional factors further complicate query processing. The first is
the cost of transferring data over the network. This data includes intermediate
files that are transferred to other sites for further processing, as well as
the final result files that may have to be transferred to the site where the
query result is needed. Although these costs may not be very high if the sites
are connected via a high-performance local area network, they become quite
significant in other types of networks. Hence, OOBMS query optimization
algorithms consider the goal of n~ducing the amount of data transfer as
an optimization criterion in choosing a distributed query execution strategy.
- Distributed Query
Processing Using Semijoin
The
idea behind distributed query processing using the semi join operation is
to reduce the number of tuples in a relation before transferring it to another
site. Intuitively, the idea is 10 send the joining column of one
relation R to the site where the other relation S is located; this column is
then joined with S. Following that, the join attributes, along with
rheattributes required in the result, are projected out and shipped back to the
original site and joined with R. Hence, only the joining column of R is
transferred in one direction, and a subset of S with no extraneous tuples or
attributes is transferred in the other direction.lf only a small fraction of
the tuples in S participate in the join, this can be quite an efficient
solution to minimizing data transfer.
- Query and Update
Decomposition
In a DDBMS with no distribution
transparency, the user phrases a query directly in terms of specific
fragments.
The user must also maintain consistency of replicated data items when
updating a DDBMS with no replication transparency.
On the
other hand, a DDBMS that supports full distribution, fragmentation, and replication
transparency allows the user to specify a query or update request on the
schema just as though the DBMS were centralized. For updates, the DDBMS is
responsible for maintaining consistency among replicated items by using
one of the distributed concurrency control algorithms. For queries, a query
decomposi. tion module must break up or decompose a query into
subqueries that can be executed at the individual sites. In addition, a
strategy for combining the results of the subqueries to form the query result
must be generated. Whenever the DDBMS determines that an item referenced in the
query is replicated, it must choose or materialize a particular replica during
query execution.
To determine which replicas include the
data items referenced in a query, the DDBMS refers to the fragmentation,
replication, and distribution information stored in the DDBMS catalog. For
vertical fragmentation, the attribute list for each fragment is kept in the catalog.
For horizontal fragmentation, a condition, sometimes called a guard, is
kept for each fragment. This is basically a selection condition that specifies
which tuples exist in the fragment; it is called a guard because only tuples
that satisfy this condition are permitted to be stored in the
fragment. For mixed fragments, both the attribute list and the guard can.
dition are kept in the catalog.
5.6 Concurrency Control and Recovery in Distributed
Databases
For
concurrency control and recovery purposes, numerous problems arise in a
distributed DBMS environment that are not encountered in a centralized DBMS
environment. These include the following:
- Dealing with multiple copies of the data items: The concurrency control method is responsible for maintaining consistency among these copies. The recovery method is responsible for making a copy consistent with other copies if the site on which the copy is stored fails and recovers later.
- Failure of individual sites: The DDBMS should continue to operate with its running sites, if possible, when one or more individual sites fail. When a site recovers, its local database must be brought up to date with the rest of the sites before it rejoins the system.
- Failure of communication links: The system must be able to deal with failure of one or more of the communication links that connect the sites. An extreme case of this problem is that network partitioning may occur. This breaks up the sites into two or more partitions, where the sites within each partition can communicate only with one another and not with sites in other partitions. .
- Distributed commit: Problems can arise with committing a transaction that is accessing databases stored on multiple sites if some sites fail during the commit process. The two-phase commit protocol (see Chapter 21) is often used to deal with this problem.
- Distributed deadlock: Deadlock may occur among several sites, so techniques for dealing with deadlocks must be extended to take this into account.
Engineering has a great scope and insights on this topic are commendable. A lot of Universities are focusing on updating DBMS syllabus because of the changing industrial trends. It is good to know that the colleges have also understood this and are aligning their course with the changing requirements to manufacture industry-ready graduates.
ReplyDelete