Dmitry Romanoff

Posted on Jul 12

Randomized Data Sampling and Aggregation in MongoDB

#mongodb #database #coding #devops

As developers, working with large datasets is often a part of our daily routines. MongoDB, a popular NoSQL database, offers powerful aggregation and sampling capabilities that help us interact with large datasets efficiently. In this article, we will go through various MongoDB commands that demonstrate random sampling, data aggregation, and sorting in MongoDB, focusing on how we can utilize them to generate, query, and analyze data in creative ways.

1. Inserting Randomized Data

Let's start by inserting a set of random books into a books collection. We'll simulate a library database, where each book has an author, title, genre, format, page count, and year of publication.

// Choose the database
use mylibrary;

// Define helper functions (basic randomizer)
function getRandomElement(arr) {
  return arr[Math.floor(Math.random() * arr.length)];
}

function getRandomInt(min, max) {
  return Math.floor(Math.random() * (max - min + 1)) + min;
}

var genres = ['Fiction', 'Non-Fiction', 'Science', 'Fantasy', 'Biography', 'History', 'Mystery', 'Romance'];
var formats = ['epub', 'pdf', 'txt', 'audio'];
var firstNames = ['John', 'Mary', 'Alice', 'Robert', 'Linda', 'Michael', 'Sarah', 'David'];
var lastNames = ['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Miller', 'Davis', 'Garcia'];

for (var i = 0; i < 1000; i++) {
  var author = getRandomElement(firstNames) + " " + getRandomElement(lastNames);
  var title = "Book Title " + (i + 1);
  var book = {
    author: author,
    title: title,
    genre: getRandomElement(genres),
    format: getRandomElement(formats),
    num_of_pages: getRandomInt(50, 1000),
    year_when_published: getRandomInt(1900, 2025)
  };

  db.books.insertOne(book);
}

In this script, we populate a MongoDB collection called books with 1000 randomly generated entries. Each book's data is randomized for the author name, genre, format, number of pages, and publication year.

2. Counting Documents in a Collection

Once the data is inserted, we may want to know how many documents exist in the books collection. MongoDB provides an easy-to-use count() method to get the total number of documents.

db.books.count()

This command will return the total count of documents in the books collection. It’s a simple way to verify that 1000 books were inserted successfully.

3. Sampling Random Books

What if you want to randomly sample a set of books? The $sample stage in the aggregation pipeline is perfect for this.

db.books.aggregate([
  { $sample: { size: 10 } },        // Randomly sample 10 documents (1% of 1000)
  { $sort: { author: 1 } }          // Sort the sampled documents by author (ascending)
])

In this example, we're randomly selecting 10 books and sorting them by author in ascending order. The $sample stage allows us to work with a random subset of the data, which can be useful for testing, statistics, or when you want to generate random previews.

4. Randomly Skipping and Limiting Results

If you need to fetch a single random document, you can use a combination of $skip and $limit. Here's how:

db.books.aggregate([ 
  { 
    $skip : db.books.countDocuments() * Math.random() 
  }, 
  { 
    $limit: 1
  }
]);

This query will randomly skip a certain number of documents, based on a random fraction of the total document count, and then return just one document. This approach ensures that you are retrieving a completely random book from the collection.

5. Aggregating Data by Format

To analyze how many books exist in each format (e.g., pdf, epub, audio), you can use the $group stage to aggregate and the $sort stage to order the results.

db.books.aggregate([
  { $sample: { size: 50 } },               // Take a 5% sample (assuming ~1000 docs)
  { $group: { 
      _id: "$format", 
      count: { $sum: 1 } 
    } 
  },
  { $sort: { count: -1 } }                 // Sort by most common format
])

This query randomly samples 50 books and then groups them by format. The result shows the most common formats in the collection, sorted in descending order of frequency.

6. Aggregating Data by Genre

Similar to the format-based aggregation, we can also aggregate books based on their genre to determine which genres are most popular.

db.books.aggregate([
  { $sample: { size: 100 } },                // Take a 10% sample (assuming ~1000 docs)
  { $group: { 
      _id: "$genre", 
      count: { $sum: 1 }
    } 
  },
  { $sort: { count: -1 } }                   // Sort by most common genre
])

This aggregation pipeline randomly samples 100 books, groups them by genre, and counts the occurrences of each genre. The results will show the most common genres in the database.

7. Dynamic Sampling Based on Document Count

In some cases, you may want to sample a dynamic percentage of documents, say 10% of the total documents. You can calculate the sample size dynamically as shown here:

var totalCount = db.books.count();

// Calculate 10% of the total (you can change 0.1 to any fraction you need)
var sampleSize = Math.floor(totalCount * 0.1);

// Perform the aggregation with the calculated sample size
db.books.aggregate([
  { $sample: { size: sampleSize } },        // Randomly sample 10% of the documents
  { $sort: { author: 1 } }                  // Sort by author (ascending)
])

In this query, we dynamically calculate 10% of the total document count and then randomly sample that many books. This allows for flexible sampling based on the collection size.

8. Combining Genre and Format Aggregation

Finally, let’s combine both genre and format in our aggregation to understand the most popular format within each genre.

db.books.aggregate([
  { $sample: { size: 100 } },  // Sample 10% (assuming ~1000 docs)
  { 
    $group: {
      _id: { genre: "$genre", format: "$format" },  // Group by genre and format
      count: { $sum: 1 }                           // Count how many books in each format within a genre
    }
  },
  { 
    $sort: { "count": -1 }  // Sort by highest count (most popular format per genre)
  },
  {
    $group: {
      _id: "$_id.genre",                  // Group by genre
      mostPopularFormat: { $first: "$_id.format" },  // Get the most popular format per genre
      count: { $first: "$count" }         // Get the count of that most popular format
    }
  },
  { 
    $sort: { "count": -1 }  // Sort by count (most popular genre)
  }
])

This complex aggregation allows you to find the most popular format within each genre and sort genres by their popularity. It's a useful query if you want insights into book preferences across genres and formats.

Conclusion

In this article, we’ve explored a variety of MongoDB commands for sampling and aggregating data. Whether you’re generating random data, sorting by author, or analyzing formats and genres, these MongoDB aggregation commands provide powerful tools for data analysis. As you work with larger datasets, these techniques will help you extract meaningful insights efficiently.

MongoDB’s aggregation framework is highly flexible, and by combining stages like $sample, $group, $sort, and $skip, you can solve a wide range of problems in data processing.

DEV Community