In an era where data flows in like a flood—tweets, transactions, telemetry, you name it—systems need to be not just fast, but also scalable, fault-tolerant, and smart. This blog dives deep into the foundational elements of distributed data processing: Distributed File Systems, HDFS, MapReduce, and Count-Min Sketch (CMS). We'll explore how they work together and even design a Twitter-style system that tracks trending hashtags in real time.
Distributed File Systems: The Backbone of Big Data
A Distributed File System (DFS) allows files to be stored across multiple servers while still appearing as a single file system to the user. Internally, DFS divides files into fixed-size blocks (e.g., 128MB), stores them across DataNodes, and manages metadata via a NameNode. Clients first contact the NameNode for metadata and then directly communicate with DataNodes to read/write blocks. This parallel access boosts I/O efficiency.
DFS is ideal for:
- Web-scale applications
- Data lakes
- Distributed backup
- Scientific and analytics workloads
HDFS: How It Works in Detail
Each file is divided into 128MB blocks by default and stored in multiple replicas (usually 3). The NameNode stores metadata — block locations, file hierarchy, replication status. The actual file data is stored in DataNodes.
When a client writes a file:
- It requests the NameNode for write permission.
- The NameNode responds with a list of DataNodes for replication.
- The client sends the block to the first DataNode, which forwards it to the second, which forwards to the third — forming a write pipeline.
Rack-awareness ensures replicas go to different racks. Clients read by contacting the NameNode, getting block locations, and pulling data directly from the nearest replica.
MapReduce: Internals, Cluster Behavior, and Example
MapReduce splits data processing into parallelizable stages.
When you run:
hadoop jar wordcount.jar WordCount /user/you/input /user/you/output
HDFS splits a 1GB file into 8 blocks (128MB each). 8 map tasks are launched on nodes where blocks reside (data locality). Each map task processes lines, emits (word, 1). Output is written to disk.
Shuffle phase groups and moves all "same-key" pairs (e.g., "hello") to a single reducer. Each reducer aggregates values and writes sorted final output to HDFS.
Word Count Java Code:
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().toLowerCase());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}
Explanation Line by Line:
-
TokenizerMapper
: breaks each line of input into words. -
map()
method: emits (word, 1) for every word in the line. -
IntSumReducer
: receives each word and a list of 1s, sums them up. -
context.write()
: outputs the final count for each word.
Count-Min Sketch (CMS): Internals, Code, and Use Cases
CMS is a compact, probabilistic data structure used for frequency counting in data streams. It offers:
- Constant-time updates and queries
- Fixed-size memory usage
- Always overestimates (but never underestimates)
It is excellent for:
- Real-time analytics
- Network monitoring
- Spam detection
- Recommendation systems
How CMS Works:
- Uses a 2D array: [depth][width]
- Each row uses a different hash function
-
add(item)
: hash item for each row and increment counter -
estimate(item)
: take min value among all rows' hash buckets
Detailed Java Implementation
import java.util.Random;
import java.util.zip.CRC32;
public class CountMinSketch {
private final int[][] table;
private final int depth;
private final int width;
private final int[] seeds;
public CountMinSketch(int width, int depth) {
this.width = width;
this.depth = depth;
this.table = new int[depth][width];
this.seeds = new int[depth];
Random rand = new Random();
for (int i = 0; i < depth; i++) {
seeds[i] = rand.nextInt();
}
}
private int hash(String item, int seed) {
CRC32 crc = new CRC32();
crc.update(item.getBytes());
long baseHash = crc.getValue() ^ seed;
return (int)(Math.abs(baseHash) % width);
}
public void add(String item) {
for (int i = 0; i < depth; i++) {
int col = hash(item, seeds[i]);
table[i][col]++;
}
}
public int estimate(String item) {
int min = Integer.MAX_VALUE;
for (int i = 0; i < depth; i++) {
int col = hash(item, seeds[i]);
min = Math.min(min, table[i][col]);
}
return min;
}
public static void main(String[] args) {
CountMinSketch cms = new CountMinSketch(10, 3);
String[] stream = {"cat", "dog", "cat", "bird", "dog", "cat"};
for (String item : stream) {
cms.add(item);
}
System.out.println("cat: " + cms.estimate("cat")); // ~3
System.out.println("dog: " + cms.estimate("dog")); // ~2
System.out.println("bird: " + cms.estimate("bird"));// ~1
System.out.println("fox: " + cms.estimate("fox")); // 0
}
}
Line-by-line explanation:
-
seeds
: ensure unique hash functions across rows. -
hash()
: combines CRC32 with a seed to get consistent, varied hash outputs. -
add()
: increments the appropriate counters in each row. -
estimate()
: finds the minimum count across all rows for a given item.
DFS stores data. HDFS coordinates blocks. MapReduce crunches massive batches. CMS handles fast, on-the-fly frequency estimates. Together, they empower scalable analytics systems used at internet-scale companies.
Top comments (0)