Distributed Databases in 5 mins

Distributed databases are powerful data management systems that distribute data across multiple locations.

Main Features

Transparency: Users do not need to know where data is stored, as the system conceals this detail.
Fault Tolerance: The system continues to function even when some components fail by creating copies of the data for protection.
Concurrency: Multiple transactions can be executed simultaneously without conflicts.
Scalability: The system can easily expand to handle increased data and user loads by adding more resources.

Primary Types of Scalability

Horizontal Scalability: Adding more servers or nodes to distribute the workload.
Vertical Scalability: Upgrading existing hardware, such as increasing memory or processing power.

Benefits of Distributed Databases

High Availability: Data replication ensures that systems remain operational despite node failures.
Improved Performance: Storing data closer to users reduces latency.
Geographic Distribution: Data is located near users to minimize delays.
Load Balancing: Distributing the workload evenly prevents bottlenecks.

Challenges in Distributed Databases

Consistency: Keeping data copies synchronized can be challenging during high traffic or network issues.
Network Latency: Communication delays can slow down transactions.
Fault Detection and Recovery: Identifying and recovering from node failures requires advanced techniques.
Complexity: Managing a distributed system is more complex than handling a centralized database.

Types of Partitioning

Horizontal Partitioning

This method groups rows based on specific criteria, such as date ranges or geographic regions. It helps speed up queries by dividing rows into smaller, manageable parts, all maintaining the same column structure.

Use Case: In a retail database, customer orders can be stored in separate tables based on geographic regions, such as North, South, East, and West, to improve query performance and manageability of regional data.

Vertical Partitioning

This method divides a table into smaller tables based on columns instead of rows. Each smaller table holds a set of columns that are often used together, thereby improving query efficiency by reducing the number of columns accessed.

Use Case: In a customer database, separate contact information (name, phone, email) into one table for frequent access while storing detailed billing or sensitive data in another table to enhance query efficiency and ensure data security.

Sharding in Databases

Sharding involves splitting a large database into smaller, independent parts called shards. Each shard is a separate database containing a subset of the data.

Benefits

Improved Performance: Faster query execution by accessing only the relevant shards.
Horizontal Scalability: Easily add or remove shards to accommodate data growth.
High Availability: A failure in one shard does not affect the others.
Efficient Resource Utilization: Workload distribution across multiple servers.

How It Works

Data is distributed based on a sharding key. Each shard manages a specific subset of data, and the system routes queries to the appropriate shard(s) using the sharding key.

Key Characteristics

Independent Shards: Each shard has its own data and resources.
Scalability: Shards can be dynamically added or removed.
Distributed Queries: Routing is based on the sharding key.
Fault Tolerance: Isolated failure risks, allowing each shard to have its own failover strategy.

Choosing a Sharding Key

When selecting a sharding key, it should:

Distribute data evenly to prevent hotspots.
Support common query patterns for efficient routing.
Minimize the need for rebalancing or data movement.

Examples of Sharding Keys:

RegionCode — to divide data by geographic regions.
DeviceID — in IoT applications to separate device-specific data.
Date — useful for handling records that are time-sensitive.
ProductCategory — in inventory management systems to categorize products effectively.

Distributed Query Processing

The process where queries are executed across multiple nodes, fetching, aggregating, and combining data to provide a unified result.

Goals:

Minimize data transfer between nodes
Optimize query speed through parallel processing
Ensure data consistency and correctness

Challenges:

Data localization and selection of appropriate nodes
Network latency and the associated data transfer costs
Complex query optimization processes
Issues related to fault tolerance and data consistency

Key Steps

Parsing & Validation — check the query's syntax and identify where the data is located.
Decomposition — break down the query into sub-queries that can be handled by specific nodes.
Optimization — select the most efficient execution plan.
Execution — execute the sub-queries on their respective nodes.
Aggregation — merge the results to produce the final output.

Techniques

Query Fragmentation — break a large query into smaller parts to run in parallel, reducing execution time.
Push-Down Predicates — apply filters close to the data source to minimize network data transfer.
Local Joins — perform joins on the same node to avoid data transfer and lower latency.
Distributed Joins — execute joins across multiple nodes, transferring only necessary data.
Data Sharding — distribute data across nodes to reduce cross-node queries and improve access speed.
Caching — store frequently used results locally to avoid repeating expensive computations.

Example

Consider a global retail chain with a distributed database containing data across multiple servers:

Server 1: Holds sales data for North America.
Server 2: Contains sales data for Europe.

Query:

"Find the total sales amount for electronics in July 2023."

Using Techniques:

Query Fragmentation: Split the query into two parts: retrieve July electronics sales from North America and Europe separately.
Push-Down Predicates: Filter for "electronics" and "July 2023" as early as possible for each data shard.
Local Joins: Join data with product categories on the same server if required.
Data Sharding: The data is already organized by regions, which minimizes cross-region data transfers.
Caching: Store recent sales summaries to swiftly respond to similar future queries.

This approach effectively reduces data transfer, accelerates processing, and aggregates total sales data across regions efficiently.