Click here to Login

DataBase Management Systems(Module 5)



Module 5
.

5.1Distributed database concepts
 A distributed computing system consists of a number of processing elements that are interconnected by a computer network and that co-operate in performing certain assigned tasks.
  A distributed database (DDB) is a collection of multiple logically interrelated databases distributed over a computer network. A distributed database management system (DDBMS) is a software system that manages a distributed database while making the distribution transparent to the user. At the physical hardware level, the following main factors distinguish a DDBMS from a centralized system:
·         There are multiple computers called sites or nodes.
·         These sites must be connected by some type of communication network to transmit data and commands among sites.

Parallel versus Distributed technology – There are two main types of multiprocessor system architecture:
  • Shared memory (tightly coupled) architecture: Multiple memory share secondary (disk) storage and also share primary memory.
  • Shared disk (loosely coupled) architecture: Multiple processors share secondary (disk) storage but each has their own primary memory.
Database management systems developed using the above types of architectures are termed parallel database management systems; rather than DDBMS they utilize parallel processor technology. In another type of architecture called shared nothing architecture, every processor has its own primary and secondary (disk) memory, no common memory exists and the processors communicate over a high-speed interconnection network. Although the shared nothing architecture resembles a distributed database computing environment, major differences exist in the mode of operation. In shared nothing architecture, there is symmetry and homogeneity of nodes; this is not true of the distributed database environment where heterogeneity of nodes is very common.Advantages of Distributed Databases:-
1. Management of distributed data with different levels of transparency:  Ideally, a DBMS should be distribution transparent in the sense of hiding the details of where each file is physically stored within the system. The following types of transparencies are possible:
·         Distribution or network transparency: This refers to the freedom for the user from the operational details of the network. It may be divided into location transparency and naming transparency. Location transparency refers to the fact that the command used to perform a task is independent of the location of data and the location of the system where the command was issued. Naming transparency implies that once a name is specified, the named objects can be accessed unambiguously without additional specification.
·         Replication transparency: Copies of data may be stored at multiple sites for better availability, performance and reliability. Replication transparency makes the user unaware of the existence of fragments.
·         Fragmentation transparency: Fragmentation makes the user unaware of the existence of fragments.
2. Increased availability and reliability: Reliability is defined as the probability that a system is running at a certain time point. Availability is the probability that the system is continuously available during a time interval. When the data and DBMS software are distributed over several sites one site may fail while other sites continue to operate. Only the data and software that exist at the failed state cannot be accessed. This improves both reliability and availability.
3. Improved performance: A distributed DBMS fragments the database by keeping the data closer to where it is needed most. Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide area networks. When a large database is distributed over multiple sites, smaller databases exist at each site. As a result, local queries and transactions accessing data at a single site have better performance because of the small local databases.  Moreover, interquery and intraquery parallelism can be achieved by executing multiple queries at different sites.
4. Easier expansion: In a distributed environment, expansion of the system in terms of adding more data, increasing database sizes or adding more processors is much easier.

Additional Functions of Distributed Databases
1. Keeping track of data: The ability to keep track of the data distribution, fragmentation and replication by expanding the DBMS catalog.
2. Distributed Query processing: The ability to access remote sites and transmit queries and data among the various sites via a communication network.
3. Distributed transaction management: The ability to devise execution strategies for queries and transactions that access data from more than one site and to synchronize the access to distributed data and maintain integrity of the overall database.
4. Replicated data management: The ability to decide which copy of a replicated data item to access and to maintain the consistency of copies of a replicated data item.
5. Distributed database recovery: The ability to recover from individual site crashes and from new types of failures such as the failure of communication links.
6. Security: Distributed transactions must be executed with the proper management of the security of the data and the authorization/access privileges of users.
7. Distributed directory (catalog) management: A directory contains information (metadata) about data in the database.
5.2 Data Fragmentation
 This is the process of breaking up the database into logical units called fragments, which may be assigned for storage at the various sites. There are mainly two types of fragmentation:
        • Horizontal fragmentation
        • Vertical fragmentation
a) Horizontal fragmentation – A horizontal fragment of a relation is a subset of the tuples in that relation. The tuples that belong to the horizontal fragment are specified by a condition on one or more attributes of the relation. Often, only a single attribute is involved. Horizontal fragmentation divides a relation “horizontally” by grouping rows to create subset of tuples, where each subset has a certain logical meaning. These fragments can be assigned to different sites in the distributed system. Derived horizontal fragmentation applies the partitioning of a primary relation to other secondary relations which are related to the primary via a foreign key. Each horizontal fragment on a relation R can be specified by a σCi(R) operation in the relational algebra. A set of horizontal fragments whose conditions C1, C2, ……., Cn include all the tuples in R (i.e. every tuples in R satisfies (C1 or C2 or …..or Cn)) is called a complete horizontal fragmentation of R. In many cases, a complete horizontal fragmentation is also disjoint; i.e. no tuple in R satisfies (Ci and Cj) for any i ≠ j.

b) Vertical fragmentation – Vertical fragmentation divides a relation “vertically” by columns. A vertical fragment of a relation keeps only certain attributes of the relation. It is necessary to include the primary key or some candidate key attribute in every vertical fragment so that the full relation can be reconstructed from the fragments. For e.g.: Consider the schema Employee (Name, Bdate, Address, Sex, SSN, Salary, DNo). We want to fragment this relation into 2 vertical fragments. The first fragment includes personal information – Name, Address, Bdate and Sex – and the second fragment includes work related information – SSN, Salary and DNo. This fragmentation is not proper because, if the two fragments are stored separately we cannot put the original employee tuples back together, since there is no common attribute between the two fragments. Hence we must add SSN attribute to the personal information fragment also. A vertical fragment on a relation R can be specified by a ПLi(R) operation in the relational algebra. A set of vertical fragments whose projection lists L1, L2, ……., Ln include all the attributes in R but share only the primary key attribute of R is called a complete vertical fragmentation of R. In this case, the projection lists satisfy the following conditions:
1. L1 U L2 U…..U Ln = ATTRS(R)
2. Li ∩ Lj = PK(R) for any i ≠ j, where ATTRS(R) is the set of attributes of R and PK(R) is the primary key of R.

c) Mixed (Hybrid) fragmentations – Mixed fragmentation is the combination of vertical fragmentation and horizontal fragmentation. In general a fragment of a relation can be constructed by a SELECT-PROJECT combination of operations ПL(σC(R)).
  • If C = True and L ≠ ATTRS(R), we get a vertical fragment.
  • If C ≠ True and L = ATTRS(R), we get a horizontal fragment.
  • If C ≠ True and L ≠ ATTRS(R), we get a mixed fragment.
d) Fragmentation schema – A fragmentation schema of a database is a definition of a set of fragments that includes all attributes and tuples in the database and satisfies the condition that the whole database can be reconstructed from the fragments by applying some sequence of OUTER UNION and UNION operations.
e) Allocation schema – An allocation schema describes the allocation of fragments to sites of the DDBS; hence it is a mapping that specifies for each fragment the site(s) at which it is stored.

5.3 Data Replication and Allocation
If a fragment is stored at more than one site, it is said to be replicated.
a) Fully replicated distributed database – If the replication of whole database is done at every site in the distributed system, the resulting database is called a fully replicated distributed database. This can improve availability remarkably because the system can continue to operate as long as at least one site is up. It also improves performance of retrieval for global queries. The disadvantage of full replication is that it can slow down update operations drastically.
b) Nonredundant allocation – In this system, each fragment is stored at exactly one site. In this case, all fragments must be disjoint except for the repetition of primary keys among vertical (or mixed) fragments.
c) Partial replication – In this system, some fragments of the database may be replicated whereas others may not. The number of copies of each fragment can range from one up to the total number of sites in the distributed system.
d) Replication schema – A description of the replication of fragments is called replication schema. Each fragment – or each copy of a fragment – must be assigned to a particular site in the distributed system. This process is called data distribution or data allocation.
5.4 Types of Distributed Database Systems

The term distributed database management system can describe various systems that dif fer from one another in many respects. The main thing that all such systems have in com. mon is the fact that data and software are distributed over multiple sites connected by some form of communication network.
            The first factor we consider is the qegree of homogen.eity of the DDBMS software. If all servers (or individual local DBMSs) use identical soft.ware and all users (clients) use identical software, the DDBMS is called homogeneous; 'otherwise, it is called heterogeneous. Another factor related to the degree of homogeneity is the degree of local auton. omy. If there is no provision for the local site to function as a stand-alone DBMS, then the system has no local autonomy. On the other hand, if direct access by local transactions to a server is permitted, the system has some degree of local autonomy.
At one extreme of the autonomy spectrum, we have a DDBMS that "looks like" a centralized DBMS to the user. A single conceptual schema exists, and all access to the system is obtained through a site that is part of the 'DDBMS--which means that no local autonomv exists. At the other extreme we encounter a type of DDBMS called a federated DDBMS (or a multidatabase system). In such a system, each server is an fndependent and autonomous centralized DBMS that has its own local users, local transactions, and DBA and hence has a very high degree of local autonomy. The term federated database system (FOBS) is used when there is some global view or schema of the federation of databases that is shared by the applications. On the other hand, a multidatabase system does not have a global schema and interactively constructs one as needed  by the application. Both systems are hybrids between distributed and centralized systems and the distinction we made between them is not strictly followed. We will refer to them as FDBSs in a generic sense.
In a heterogeneous FOBS, one server may be a relational DBMS, another a network DBMS, and a third an object or hierarchical DBMS; in such a case it is necessary to have a canonical system language and to include language translators to translate subqueries nom the canonical language to the language of each server. We briefly discuss the issues affecting the design of FDBSs below.

Federated Database Management Systems Issues

. The type of heterogeneity present inFDBSs may arise from several sources.
  • .. Differences in data models: Databases in an organization come from a variety of data mod­els including, the relational data model, the object data model, etc.The modeling capabilities of the models vary. Hence, to deal with them uniformly via a single global schema or to process them in a single language is challenging. Even if two databases are both from the RDBMS environment, the same information may be represented as an attribute name, as a relation name, or as a value in different databases. This calls for an intelligent query-processing mechanism that can relate information based on metadata.
·         . Differences in constraints: Constraint facilities for specification and implementation vary from system to system-. There are comparable features that must be reconciled in the construction of a global schema. For example, the relationships from ER models are represented as referential integrity constraints in the relational model. Triggers may have to be used to implement certain constraints in the relational model. The global schema must also deal with potential conflicts among constraints.
·         . Differences in query languages: Even with the same data model, the languages and their versions vary. For example, SQL has multiple versions like SQL-89, SQL-92 (SQL2), and SQL3, and each system has its own set of data types, comparison opera­tors, string manipulation features, and so on.


Semantic Heterogeneity.
               Semantic heterogeneity occurs when there are differences in the meaning, interpretation, and intended use of the same or related data. Semantic het­erogeneity among component database systems (OBSs) creates tne mggest hurdle in designing global schemas of heterogeneous databases. The design autonQmy of compo­nent OBSs refers to their freedom of choosing the following design parameters, which In turn affect the eventual complexity of the FOBS:
·         . The universe of discourse from which the data is drawn: For example, two customer accounts databases in the federation may be from United States and Japan with entirely different sets of attributes about customer accounts required by the account­ing practices. Currency rate fluctuations would also present a problem. Hence, rela­tions in these two databases which have identical nameS---CUSTOMER or ACCOUNT may have some common and some entirely distinct information.

·         . Representation and naming: The representation and naming of data elements and,the structure of the data model may be prespecified for each local database.
·         . The understanding, meaning, and subjective interpretation of data.
      This is a chief contributor to semantic heterogeneity ­
·         . Transaction,and policy constraints: These deal with serializability criteria, compensating transactions, and other transaction policies.
·         Derivation of summaries: Aggregation, summarization, and other data-processing features and operations supported by the system.
5.5            Query Processing in Distributed Databases

·         Data Transfer Costs of Distributed Query Processing
               In a distributed system, several additional factors further complicate query processing. The first is the cost of transferring data over the network. This data includes intermediate files that are transferred to other sites for further processing, as well as the final result files that may have to be transferred to the site where the query result is needed. Although these costs may not be very high if the sites are connected via a high-performance local area network, they become quite significant in other types of networks. Hence, OOBMS query optimization algorithms consider the goal of n~ducing the amount of data transfer as an optimization criterion in choosing a distributed query execution strategy.

  • Distributed Query Processing Using Semijoin

           The idea behind distributed query processing using the semi join operation is to reduce the number of tuples in a relation before transferring it to another site. Intuitively, the idea is 10 send the joining column of one relation R to the site where the other relation S is located; this column is then joined with S. Following that, the join attributes, along with rheattributes required in the result, are projected out and shipped back to the original site and joined with R. Hence, only the joining column of R is transferred in one direction, and a subset of S with no extraneous tuples or attributes is transferred in the other direc­tion.lf only a small fraction of the tuples in S participate in the join, this can be quite an efficient solution to minimizing data transfer.   

  • Query and Update Decomposition

  In a DDBMS with no distribution transparency, the user phrases a query directly in terms of specific fragments.
  The user must also maintain consistency of replicated data items when updating a DDBMS with no replication transparency.
On the other hand, a DDBMS that supports full distribution, fragmentation, and replication transparency allows the user to specify a query or update request on the schema just as though the DBMS were centralized. For updates, the DDBMS is responsible for maintaining consistency among replicated items by using one of the distributed concurrency control algorithms. For queries, a query decomposi. tion module must break up or decompose a query into subqueries that can be executed at the individual sites. In addition, a strategy for combining the results of the subqueries to form the query result must be generated. Whenever the DDBMS determines that an item referenced in the query is replicated, it must choose or materialize a particular replica dur­ing query execution.
    To determine which replicas include the data items referenced in a query, the DDBMS refers to the fragmentation, replication, and distribution information stored in the DDBMS catalog. For vertical fragmentation, the attribute list for each fragment is kept in the catalog. For horizontal fragmentation, a condition, sometimes called a guard, is kept for each fragment. This is basically a selection condition that specifies which tuples exist in the fragment; it is called a guard because only tuples that satisfy this condition are permitted to be stored in the fragment. For mixed fragments, both the attribute list and the guard can. dition are kept in the catalog.
 5.6 Concurrency Control and Recovery in Distributed Databases
For concurrency control and recovery purposes, numerous problems arise in a distributed DBMS environment that are not encountered in a centralized DBMS environment. These include the following:
  • Dealing with multiple copies of the data items: The concurrency control method is responsible for maintaining consistency among these copies. The recovery method is responsible for making a copy consistent with other copies if the site on which the copy is stored fails and recovers later.
  • Failure of individual sites: The DDBMS should continue to operate with its running sites, if possible, when one or more individual sites fail. When a site recovers, its local data­base must be brought up to date with the rest of the sites before it rejoins the system.
  •  Failure of communication links: The system must be able to deal with failure of one or more of the communication links that connect the sites. An extreme case of this problem is that network partitioning may occur. This breaks up the sites into two or more partitions, where the sites within each partition can communicate only with one another and not with sites in other partitions. .
  •  Distributed commit: Problems can arise with committing a transaction that is access­ing databases stored on multiple sites if some sites fail during the commit process. The two-phase commit protocol (see Chapter 21) is often used to deal with this problem.
  • Distributed deadlock: Deadlock may occur among several sites, so techniques for deal­ing with deadlocks must be extended to take this into account.




















1 comment: Leave Your Comments

  1. Engineering has a great scope and insights on this topic are commendable. A lot of Universities are focusing on updating DBMS syllabus because of the changing industrial trends. It is good to know that the colleges have also understood this and are aligning their course with the changing requirements to manufacture industry-ready graduates.

    ReplyDelete