Distributed Databases
Index
- Introduction
- Objectives
- 1.Distributed Processing
- 2.Brief History and (Tentative) Classification of Distributed DBMSs
- 3.Distributed DBMS Architectures
- 4.Distributed Database Design
- 4.1.Data Fragmentation
- 4.1.1.Horizontal Fragmentation
- 4.1.2.Derived Horizontal Fragmentation
- 4.1.3.Vertical Fragmentation
- 4.1.4.Fragmentation in Oracle
- 4.2.Data Allocation
- 4.1.Data Fragmentation
- 5.Database Integration
- 5.1.Heterogeneities
- 5.2.Schema Matching and Schema Mapping
- 5.3.Wrappers
- 5.4.Mediators
- Summary
- Self-evaluation
- Answer key
- Glossary
- Bibliography
Introduction
Objectives
-
Explain the historical background of distributed database systems.
-
Name the main characteristics of a distributed database.
-
Enumerate the benefits of using a distributed database.
-
Name and explain the basics of different approaches for implementing heterogeneous distributed database management systems (DDBMSs).
-
Explain the three degrees of transparency that a DDBMS might provide.
-
Name different approaches to fragment data.
-
Explain the three desirable properties of data fragmentation.
-
Explain the main difference between distributed databases and parallel databases.
-
Given a database schema and its workload, decide which data fragmentation approach is most appropriate.
-
Given an already fragmented distributed database, discuss whether it fulfills the three desirable properties for data fragmentation.
-
Reconstruct global relations from their fragments (if possible).
-
Given a specific scenario, model simple distributed databases (i.e., decide how to fragment the relations and where to allocate them according to the workload).
-
Define horizontal, vertical or hybrid fragmentation of relations.
-
Name different heterogeneities among component databases.
-
Explain the wrapper concept.
-
Know the Global-as-View mediator.
1.Distributed Processing
“A distributed computing system is a number of autonomous processing elements (referred to as a computing device that can execute a program on its own), not necessarily homogeneous, that are interconnected by a computer network and that cooperate in performing their assigned tasks.”
Tamer Özsu, M.; Valduriez, P. (2011). Principles of Distributed Database Systems. New York: Springer. (pg. 2).
“Cloud computing encompasses on demand, reliable services provided over the Internet with easy access to virtually infinite computing, storage and networking resources.”
Tamer Özsu, M.; Valduriez, P. (2011). Principles of Distributed Database Systems. New York: Springer. (pg. 744).
1.1.Distributed Database Systems
“A distributed database (DDB from now on) is a collection of multiple, logically interrelated databases (known as nodes or sites) distributed over a computer network. A distributed database management system (DDBMS from now on) is thus, the software system that permits the management of the distributed database and makes the distribution transparent to the users.”
Tamer Özsu, M.; Valduriez, P. (2011). Principles of Distributed Database Systems. New York: Springer. (pg. 3)
-
The quality of being logically interrelated stresses the fact that, like a DB, a DDB is more than a simple collection of files. Files should be somehow structured and a common access interface should be provided. For this reason, the physical distribution of data (e.g., data partitioning and replication) does matter, and it becomes one of the most conflictive aspects of DDBMS.
-
In close connection with the previous point, note that data may be distributed over large geographical areas; however, there may also be the case where the distributed data are in the very same room. The only constraint imposed is that the communication between nodes is done through a computer network instead of shared memory or disk space. Note that this excludes parallel databases, since they do not meet this requirement.
-
Another common mistake is to think that, despite communicating through a computer network, the database resides in only one node. If this were so, it would not be very different from a centralized database and would not pose new challenges. On the contrary, DDBMSs are a different matter, providing an environment where data are distributed among a number of sites (see Fig. 1).
-
Finally, making distribution transparent to the user is, indeed, a huge accomplishment with many implications. Transparency refers to separation of a system’s higher-level semantics from lower-level implementation issues. Thus, the system must hide the implementation details. For example, the user must be able to execute distributed queries without knowing where data are physically stored; hence, data distribution and replication must be an internal issue for the DDBMS. Similarly, the DDBMS must ensure safety properties at every moment. Examples are the ACID transaction properties, which are obviously affected by distribution; dealing with update transactions, which must guarantee data consistency when replication happens (i.e., synchronization between copies); and coping with node failures to guarantee system availability.
1.2.Characteristics of Distributed Database Systems
1.2.1.Distribution Transparency
-
Data independence: This is fundamental to any form of transparency, and is also common to centralized DBMSs. Basically, data definition occurs at two different levels: logical (schema definition) and physical. Logical data independence refers to user applications’ immunity to changes in the logical structure (i.e., the schema) of the database, whereas physical data independence, on the other hand, hides storage details from the user.
-
Network transparency: In a centralized database, the only resource to be shielded is data. In a DDBMS, however, there is a second resource to be likewise protected: the network. Preferably, the user should be protected from the operation details of the network, even hiding its existence whenever possible. Two kinds of transparency are identified at this level: location and naming transparency. The former underlines the fact that any task performed is independent of both the location and system where the operation must be performed. The latter refers to the fact that each object has a unique name in the database. In the absence of this, the user is required to embed the location name with the object name. Location transparency is needed in order to have naming transparency. In our example, naming transparency means that we can refer to the employee table, instead of Barcelona:employee (which refers to the fragment in Barcelona).
-
Replication transparency: As briefly discussed at the beginning of this section, it may be advantageous to replicate data over different nodes. The main reason to do so is performance, since we may avoid network overhead by accessing data locally (i.e., replication increases locality of references) and may use replicas to perform certain actions in a more efficient way, using other copies. Furthermore, it also provides robustness: if one node fails, we still have the other copies to access. However, the more we replicate, the more difficult it becomes to deal with update transactions and to keep all replicas consistent. Ideally, all these issues should be transparent to the users, and they should act as if a single copy of data were available (thus, as if one object were available instead of many). In our example, if the salary table is replicated at each node, the DDBMS itself will be responsible for maintaining consistency between the replicas and for choosing which replicas to use for each action. All these issues are dealt with in detail in section 4.
-
Fragmentation transparency: In our example we discussed the usefulness of having different fragments for each relation. Thus, each fragment would be a different object. Again, the main reason for fragmentation is reliability, availability and performance, where it can be seen as a way to diminish the negative aspects of replication. When relations are fragmented, we face the problem of handling queries over relations to be executed over relation fragments. This typically entails a translation from the global query into fragment queries. If this transparency level is provided, the translation is performed by the DDBMS. For example, if we want to see all the employees in our supposed company, the user will query all the tuples in the employee relation. The DDBMS will then be responsible for breaking this global query into fragment queries to be posed at each node, and to compose the global result from the partial results obtained. To do so, the DDBMS must know the criteria used to fragment the relation, but this will be further explained in section 4.
“Applications coded with transparent access to geographically distributed databases have: poor manageability, poor modularity and poor message performance.” Gray, J. (1989).
Transparency in its place - the case against transparent access to geographically distributed data. Cupertino: Technical Report TR89.1, Tandem Computers, Inc. (pg. 11).
1.2.2.Improved Performance
-
Each node handles its own portion of the database, decreasing the competition (in the sense of blocking) for CPU and I/O operations.
-
Localization reduces the number of remote accesses to data, which in turn, implies less propagation delays.
1.2.3.Easier System Expansion
1.3.Challenges Associated with Distributed Database Systems
2.Brief History and (Tentative) Classification of Distributed DBMSs
-
Tightly coupled federated databases require the definition of a global integration schema that contains mappings to all participating database schemas. The federated database becomes a central server management system on top of the participating autonomous databases. Note that such a low degree of autonomy can make entering or leaving the federation become problematic for the participants.
-
As the number of databases to integrate increases, it becomes very difficult or impossible to define a global integration schema over the large number of autonomous databases. Multi-databases provide no global conceptual schema and instead a multi-database query language allows specification of queries that search through many participating databases. Consequently, a node can easily enter or leave the DDBMS.
-
Loosely coupled federated databases provide a middle ground between a single integration schema (tightly coupled federated databases) and no schema at all (multi-databases). Instead, the designer can define views that combine and reconcile data from different data sources. You can think of these views as similar to relational views. Such views, however, require a query language that can express queries over several databases, i.e., a multi-database query language. Thus, on one hand they provide a certain degree of integration (views) and, on the other hand, they use an ad hoc query language to overcome heterogeneities.
Autonomy |
Central Schema |
Query Transparency |
Update Transparency |
|
---|---|---|---|---|
Homogeneous DBs |
No |
Yes |
Yes |
Yes |
Tightly Coupled Federated DBs |
Low |
Yes |
Yes |
Limited |
Loosely Coupled Federated DBs |
Medium |
No |
Yes |
Limited |
Multi-databases |
High |
No |
No |
No |
-
Autonomy: This column focuses on the autonomy of nodes in the DDBMS. As previously discussed, nodes in homogeneous databases are not autonomous by definition, whereas heterogeneous solutions are built on top of (more or less) autonomous databases.
-
Central Schema: This column focuses on the presence of a central schema in the system. Homogeneous databases and tightly coupled federated databases rely on central schemas to tackle the integration issue, whereas the rest of heterogeneous solutions relax this constraint to deal with large numbers of nodes.
-
Query Transparency: Query transparency refers to whether distribution of data is reflected in how users pose queries –in other words, if, from the user perspective, it gives the impression of a single database. Query transparency is primarily achieved when a single schema is provided: as views in loosely coupled federated databases, as a global schema for tightly coupled federated databases and as a single logical schema for homogeneous distributed databases.
-
Update Transparency: This transparency level only affects DBAs; it refers to how updates are processed internally and if distribution is taken into account.
In general, it is difficult to achieve consistent updating for heterogeneous databases since the nodes are autonomous and the integration layer may not have access to the local transaction managers. However, tightly and loosely coupled federated databases may partially achieve it. In the first case, some updates in the global integration schema might be propagated to the underlying databases. In the second case, they define views over data, and the problem is reduced to the update through views problem (and thus, limited to it).
3.Distributed DBMS Architectures
3.1.Architectures for Homogeneous DDBMSs
3.1.1.The ANSI/SPARC Schema Architecture
3.1.2.Extending a Centralized DBMS Functional Architecture
-
The global query manager contains the view manager, the security manager, the constraint checker and the query's semantic and syntactic optimizers. All of these behave as in a centralized DBMS, except the syntactic optimizer that is extended to consider data location (by querying the global data catalog). Finally, the physical optimizer is replaced, at the global level, by the global execution manager. This new extension is responsible for exploiting the metadata stored in the global catalog, for deciding issues such as which node executes what (according to replicas and fragments available and communication cost over the network) and for certain execution strategies that are relevant for distributed query execution, such as minimizing the size of intermediate results to be sent over the network (e.g., deciding among join strategies) and exploiting parallelism. Finally, it inserts communication primitives in the execution plan.
-
The global scheduler then receives the global execution plan produced in the previous module and distributes tasks between the available sites. It will be responsible for building up the final result from all the subqueries that were executed in a distributed context. The global scheduler also applies distributed algorithms to guarantee the ACID transaction properties.
-
Next, each node receives its local execution plan and, by querying its catalog, generates a local access plan by means of the local query manager. The local query manager behaves similarly to a centralized query manager and, among other duties, decides which data structures must to used to optimize data retrieval. Subsequently, the data flow goes through the recovery and buffer managers, as in a centralized DBMS.
3.2.Architectures for Heterogeneous DDBMSs
3.2.1.Five-Level Schema Architecture
3.2.2.Wrapper-Mediator Functional Architecture
3.2.3.Peer-to-Peer Functional Architecture
-
Mass distribution, since they are composed of thousands of nodes (peers) with a wide geographical distribution and have the possibility of forming groups in certain places.
-
The inherent heterogeneity of the peers and their total individual autonomy.
-
System volatility, since each peer is usually a personal computer that enters and leaves the system at will, making data management more complex.
4.Distributed Database Design
4.1.Data Fragmentation
-
Completeness: Given a relation R and a set of fragments, any data item (either a tuple or a set of attributes) of R can be found in at least one fragment. Thus, data are not lost when fragmenting.
-
Disjointness: Given a relation R and a set of fragments, any data item placed in a fragment cannot be found in any of the other fragments. Remember that data replication must be considered a posteriori, in the allocation stage, as a task independent of data fragmentation.
-
Reconstruction: Given a relation R and a set of fragments, the original relation can always be reconstructed from the fragments by means of relational algebraic operators.
4.1.1.Horizontal Fragmentation
Q2: SELECT * FROM project WHERE city = ‘Glasgow’ AND budget > 10000
Q3: SELECT * FROM project WHERE city = ‘Frankfurt’
-
For equalities, we need to add a predicate for each value in the domain. Thus, for the city attribute, we need to add city = ‘Portland’ and city = ‘Hong Kong’.
-
For ranges, we need to complete the range. In the case of the budget attribute, we need to add budget <= 10000.
-
Completeness: The fragmentation predicates must ensure that every tuple is assigned to at least one fragment. As long as the fragmentation predicates are complete, the final fragmentation is also guaranteed to be complete.
-
Disjointness: The fragmentation predicates must be mutually exclusive. In other words, one tuple must be placed in at most one fragment. This is usually referred as the minimality property for horizontal fragments.
-
Reconstruction: The union of all the fragments must constitute the original relation. Thus, .
4.1.2.Derived Horizontal Fragmentation
-
The fragmentation used most by users / applications (i.e., which subset is closer to what users and applications use),
-
The fragmentation that maximizes parallel execution of the queries.
-
Completeness: The relationship used to semijoin the two relations must enforce the referential integrity constraint.
-
Disjointness: The join attribute must be the owner’s key.
4.1.3.Vertical Fragmentation
-
Completeness: The union of the projected attributes in each fragment must produce the original relation.
-
Disjointness: A given attribute (except for the primary key) can be used to produce only one fragment. In other words, every attribute (but the primary key) appears in one and only one fragment.
-
Reconstruction: To guarantee reconstruction, the primary key must be replicated in each fragment. Intuitively, this is needed to keep track of the original tuple and be able to reconstruct it by means of joins through the replicated PK. Thus, .
4.1.4.Fragmentation in Oracle
CREATE TABLE project ( pno NUMBER(8,0) PRIMARY KEY, name VARCHAR(20), city VARCHAR(10), country VARCHAR(8), budget NUMBER(7,23), category VARCHAR(10), productivityRatio NUMBER(3,2), income NUMBER(7,2)) PARTITION BY LIST (city) ( PARTITION node1 VALUES ('BARCELONA'), PARTITION node2 VALUES ('GLASGOW'), PARTITION node3 VALUES ('HONG KONG'), PARTITION node4 VALUES ('FRANKFURT'), PARTITION node5 VALUES ('PORTLAND'));
CREATE TABLE project (...) PARTITION BY RANGE (budget) ( PARTITION node1 VALUES LESS THAN (10001), PARTITION node2 VALUES LESS THAN (450001));
CREATE TABLE project (...) PARTITION BY HASH (pno) ( PARTITIONS 4; );
CREATE TABLE project (... ) PARTITION BY RANGE (budget) SUBPARTITION BY HASH (pno) SUBPARTITIONS 4 ( PARTITION node1 VALUES LESS THAN (10001), PARTITION node2 VALUES LESS THAN (450001));
CREATE TYPE info AS OBJECT ( name VARCHAR(20), city VARCHAR(10), country VARCHAR(8)); CREATE TYPE analyticalItem AS OBJECT ( budget NUMBER(7,23), category VARCHAR(10), productivityRatio NUMBER(3,2), income NUMBER(7,2)); CREATE TYPE generalInfo AS TABLE OF info; CREATE TYPE analyticalInfo AS TABLE OF analyticalItems;
CREATE TABLE project ( pno NUMBER(8,0) PRIMARY KEY, infoData generalInfo, analyticalData analyticalInfo); NESTED TABLE infoData STORE AS infoDataNT; NESTED TABLE analyticalData STORE AS analyticalDataNT;
4.2.Data Allocation
-
Minimal cost: A function that results from computing the cost of storing each fragment Fi at a certain node Ni, the cost of querying Fi at Ni, the cost of updating each fragment Fi at all sites where it is replicated, and the cost of data communication. The allocation problem places each fragment at one or several sites in order to minimize this combined function.
-
Performance: In this case, we aim either to optimize system response time (given a set of queries) or maximize throughput at each node.
5.Database Integration
5.1.Heterogeneities
5.2.Schema Matching and Schema Mapping
5.3.Wrappers
SELECT * SELECT serial_n, make, model, make, model,auto_transmission, color FROM med_car => FROM vehicle WHERE color = '$c'; WHERE color = '$c';
5.4.Mediators
-
f (free), the attribute can be specified or not, as required.
-
b (bound), we must specify a value for the attribute (any value is allowed).
-
u (unspecified), no value is specified for the attribute.
-
c[S] (choice from set S), we must specify a value among those belonging to set S.
-
o[S] (optional from set S), we can choose whether to specify the value or not, but if we do, it must belong to set S.
-
To query all of the details of a vehicle with a given serial number without the serial number forming part of the result, the vehicle relation of CDB1 should have the adornment b’uuuuu. This adornment is expressed as: vehicle b’uuuuu( serial_n, make, model, auto_transmission, GPS, color).
-
To query all the details of a particular make and model, choosing the color and optionally setting whether the vehicle has automatic transmission and GPS, the vehicle relation of CDB1 must have the adornment ubbo[yes, no]or[yes, no]b. This adornment is expressed as: vehicle ubbo[yes, no]or[yes, no]b( serial_n, make, model, auto_transmission, GPS, color).
R |
(w, |
x) |
S |
(x, |
y) |
T |
(y, |
z) |
1 |
2 |
2 |
4 |
4 |
6 |
|||
1 |
3 |
3 |
5 |
5 |
7 |
|||
1 |
4 |
5 |
8 |
Summary
Self-evaluation
-
Enumerate and briefly describe the main approaches that exist for handling heterogeneous distributed database systems.
-
Briefly describe the main objectives of the allocation stage when designing a distributed database from scratch. What are the main difficulties the designer will face at this stage?
-
Enumerate and briefly describe the main transparency layers that a distributed database management system can provide.
-
What kind of fragmentation strategy has been applied?
-
Are these fragments semantically correct? Justify your answer.
-
Briefly explain what fragmentation strategy they have applied. Justify your answer.
-
Is this fragmentation strategy complete and disjoint? Can we reconstruct the global relations?
Answer key
-
Traditionally, heterogeneous DDBMSs have been classified as (tightly or loosely coupled) federated databases, multi-databases, and peer-to-peer systems. Tightly coupled federated databases rely on a global integration schema sitting on top of the pre-existing nodes. Multi-databases provide no global conceptual schema, but offer an ad hoc multi-database language to query the participating databases. As a middle ground, loosely coupled federated databases provide views over the participating nodes which provide integration. However, these views are defined by means of a multi-database language. Unlike these approaches, peer-to-peer systems relax query consistency, but provide query transparency without using a central schema. Queries simply hop from one node to another until a time-out is triggered and no other nodes are queried.
-
Once we have fragmented our relations, the allocation stage seeks to allocate each fragment to a node, in such a way that a certain optimization criterion is met. Whether to replicate fragments in several nodes is a decision that should be made at this stage. Normally, the optimization criterion is either to minimize the system costs (e.g., storing and updating fragments) or to boost performance (i.e., response time). However, too many factors must be taken into account to cope with these criteria and, unfortunately, their optimization has proved to be a NP-hard problem. Currently, DDBMSs and DBAs usually rely on simple optimization algorithms to solve this problem.
-
To fully achieve distribution transparency we must guarantee data independence (between the logical and physical schema), network transparency (which includes naming and location transparency and seeks to hide the existence of the network from the user), replication transparency (hiding the existence of replicas) and fragmentation transparency (hiding the existence of fragments). Finally, update transparency, closely related to the replication transparency layer, seeks to synchronize replicas transparently to the user.
-
It is a hybrid fragmentation strategy. Specifically, a horizontal strategy nested in a vertical fragmentation.
-
No, it is not. The vertical fragmentation is correct (it is complete, disjoint and can be reconstructed) but the horizontal fragmentation strategy is not (it is not complete, since the equalities are not considered in any of the fragments). Thus, the hybrid fragmentation is not correct either, since the nested fragmentation strategies must be correct to produce a correct hybrid approach.
-
They have applied a vertical fragmentation over Kids, a horizontal fragmentation over Toys and a derived horizontal fragmentation over Requests.
-
Kids has been fragmented in two subsets by projecting its attributes. The first subset consists of {kidId, name} and the second one of {kidId, address, age}.
-
Toys has been fragmented in two subsets by applying two selections. The fragment predicates are: (price < 150) and (price > = 150).
-
Requests has been fragmented in two subsets by considering the FK-PK relationship between requests and toys. To do so, a semijoin has been performed to decide how to fragment requests according to toys.
-
-
The fragmentation strategy chosen is complete, disjoint and the original relations can be reconstructed.
-
Kids is complete since all the table attributes are considered in at least one of the fragments. It is disjoint because no attribute (besides the primary key) has been projected in more than one fragment. Finally, it can be reconstructed because the primary key kidId has been replicated in each fragment.
-
Toys is complete since all the domain values of the price attribute are considered in the ranges used, and it is disjoint since the ranges are mutually exclusive. Finally, it can be reconstructed by uniting both fragments (since they are disjoint and complete).
-
Requests is complete since the relationship used to semijoin both relations has been implemented as a PK-FK constraint. It is disjoint because the semijoin involves the owner’s key (i.e., toyId). Finally, the original relationship can be reconstructed by uniting the fragments, which are known to be, as a whole, complete and disjoint.
-
Glossary
- Allocation (or data allocation)
- This problem appears in top-down design of distributed databases. Given a set of fragments, the data allocation problem consists of allocating data at the available nodes in such a way that some optimization criterion is met.
- ANSI/SPARC architecture
- The reference schema architecture for databases.
- CDB
- Component Database.
- Data locality
- In relation to distributed systems, it refers to placing data where it is needed to minimize communication overhead, and consequently, ensure greater efficiency and better performance.
- DBA
- Database Administrator.
- DDB
- Distributed database.
- DDBMS
- Distributed database management system.
- Federated databases
- require the definition of a global integration schema that contains mappings to the participating databases’ schemas. The federated database becomes a central server on top of the participating autonomous databases.
- Fragmentation
- The problem of breaking a relation into smaller pieces to be distributed over a network.
- Mediators
- offer a solution to deal with representation and abstraction problems and exchange objects across multiple, heterogeneous information sources.
- Partitioning
- Essentially following the same principle as fragmentation, it differs in that resulting fragments continue to be local and are not spread over a network. Partitioning can be used for many purposes, but mainly to benefit from parallelism and to implement privacy.
- Peer-to-peer systems
- are used for efficient, scalable sharing of individual peers’ resources among the participating peers.
- Replication
- The same fragment is allocated to several nodes. Used primarily to improve reliability and efficiency of read-only queries.
- Scalability
- In distributed systems, it refers to system expansion.
- Semantic heterogeneity
- refers to a problem that arises when the data to be integrated have been developed by different groups for different purposes.
- Transparency
- Transparency refers to separation of the higher-level semantics of a system from lower-level implementation issues.
- Wrappers
- are programs that extract data from information sources with changing content and translate the data into a different format.