Bloom Filter

Yao — Fri, 29 Sep 2023 10:30:26 +0000

Abstract

This is my 2nd post. In this post, I'd like to record some of the key learnings from a video (see bottom) of Bloom Filter, a popular technique for checking whether a collection contains a specific target by its key.

From wikipedia: A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives.

Key Learnings

Here are some of the key learnings:

Bloom Filters are a privacy data structure that can give false positives but not false negatives. In other words, when checking existence, it may say Yes while the truth is No, but it will never say No while the true is Yes.
Bloom filter simply works by recording a key in an array. The input key is converted using some hash functions along with modularization to fit in the underlying array. Checking the existence of the key is O(1) operation given the accurate index of the key can be calculated using the same way.
Collision may occur if a key is not inserted before but the hash function produces a value that duplicates with another already inserted item.
Bloom filter is space efficient because of multiple reasons.
- It does not need to store the entire object but only the key.
- The key is mapped to an index of array, in which the space is fixed and commonly much smaller than the original dataset size.
- Using a byte buffer to optimize space consumption. Basically, we can switch from Boolean[] to Byte[], while each Byte in the array represents 8 items. Bit manipulation techniques were utilized to reduce space usage by a factor of 8.
Using multiple hash functions to reduce false positives. Multiple hash functions produces a sequence of bits that need to be set and checked, hence possibly reducing collision rate.

:P 50% of the above summarization is generated by using GPT-backed tool. Only thing I needed to do is to feed the youtube link.

When to use Bloom Filter

I've seen Spark leverages Bloom Filter to speed up joins between tables as well as filters on single table, which is a quite common scenario.

By using bloom filter, one can answer the question that whether a key is possibly existed in a collection. Assuming 1% false positive rate is there. One can eliminate 99% of the candidates in the beginning after creating and possibly transferring very small amount of data (the bloom filter itself). In reality, the false positive rate can be much smaller.

Credit to Ytb Video link: https://www.youtube.com/watch?v=UVFnabieyzc&ab_channel=AsliEngineeringbyArpitBhayani

What You Should Do as A Senior Software Engineer - Retrospective 2023

Yao — Fri, 29 Sep 2023 10:29:07 +0000

In 2022, I was promoted to Senior SDE in one of the FAANG company. Along the journey, I've received many helps from my senior co-workers and leaderships. Very often, I was doing something that a senior engineer should not do. When that happened, their guidance often helped me to get back to the right track. I feel it's time to do a retrospective to periodically remind myself that what kind of expectations do people hold on senior software engineers.

Role and Responsibility

To summarize everything in one sentence, a senior engineer is responsible for making the right technical decisions within a one pizza team.

As the baseline, you will be responsible for making system or architecture designs for the most complex projects in the team. The size of the project could vary from a few months to a few years. You are expected to make the initial design and finalize the design with relevant stakeholders including business team, senior leadership, principal engineers etc.

At the same time, you will not be limited to yourself. Your junior teammates need your help. They will frequently seek your inputs on various topics including system designs, on-call issues, best practices, or even career suggestions.

Last but not least, for all collaborations with other teams, you are, by default, the to-go person for any technical questions.

In some company, you may also assume part of the SDM role to draft engineering plan for most of the projects that your team carries.

Collaboration, Communication, Writing

A senior engineer is always expected to be accurate and precise. Their words are meant to be useful and easy-to-understand. (Yes, I am still far from there 😄)

Accuracy

As a senior engineer who is accountable for every technical thing in the team, your words are taken seriously by others. The first thing you'd like to avoid is repeatedly using non-deterministic words such as "probably", "might", "perhaps" which adds possibility and reduces accuracy in the conversation. I've seen one of the senior leadership that I previously worked with has one line in the day-1 intro: "I will not use probably when I talk. Even if I do, there's 80%+ chance that the truth is what I say". To holds high accuracy, senior engineers only share content from verified sources. If things aren't clear, they will call out that verification is needed.

Precision

This is something that many folks do not pay enough attention to. Amazon has a well-known culture of writing. Depending upon the scope, they demand a doc should be fit into either 1 page or 6 pages (check it out. In this idea, most of the low-level technical doc can be fit into one page while most of the high-level technical or business docs may need 6 pages. Why limit the number of pages? In short, it helps both the author and readers to focus on the most important things with enough efficiency in the reading and writing. Here's a real example of mine. I used to write a 3-page doc to explain one complex technical issue. In the doc, I explained how the overall system works, what's the problem in detail and how I feel the things should be changed. My skip-manager took it over and spent few days to re-write it into one page. While being amazed by how precise the new doc is, I noticed about a few key changes:

Remove irrelevant things into the context. People often try to put a lot of things into the context section, which they feel other folks should also pay attention to. However, there are many things that either may not be relevant or the readers may already know. For instance, it's not needed to explain how Apache Spark parses a sql query while you are trying to propose an idea of perf optimization to Spark team.
Let the data talk. People often uses large amount of words to explain or describe an issue. It's not straight forward rather not convincing. Instead, a good doc uses data to present truths. With data, it becomes easy to guide the readers to get onto the same direction.
Have clear structure and goal. One critical thing I've learned is that writing a doc is just like telling a story or having a conversation with someone. You need to be crystal clear on what the expected outcome is. Are you trying to propose a business idea? or giving a solution to a technical issue? or even doing a KT session? By setting up a goal, it helps you to remain focused and avoid going deviated or even back-and-forth. After that, you should create a clear structure for the conversation so that the audiences can follow your lead to reach the conclusion that you want to draw.

Execution

Senior engineers hold full responsibility on delivering. They ensure that everyone in the team remains on-track to deliver high-quality products before the deadline. It often means three things:

Frequently review the current progress against the timeline.
Help others get unblocked if they are blocked due to technical issue or from external dependencies.
Periodically report to key stakeholders about the progress and the upcoming things.

In the above, I've shared my view on some of the key expectations from senior engineers. These expectations are not just one-time things but rather should be constantly followed.

As this is my first post in DEV, I wish that this post sets a new start for myself. When we get promoted to senior engineer, we've not reached a terminal but rather a new gate for take-off.

"Stay Humble and Stay foolish."

DEV Community: Yao