💼 I’m currently looking for a Data Engineer position which could offer sponsorship visa, if you think this article inspires a lot and you’re willing to help, please contact me on my LinkedIn profile 🔗Foster Guo, or refer to my online resume Foster Guo - Data Engineer to find my email or phone number.
Apple is my nickname, because my Chinese name is ‘富士’, which can be referred to Fuji Apple, Mount Fuji, Fujitsu, FujiFilm etc.
🍎 General, Elegant, Efficient —— Designed and developed by Apple
Basically, it’s a framework designed to assist content teams in retrieving data independently, without relying on data analysts or engineers.
Train of thought
📌 Apple said, let there be a platform implements strategy recall, and there was strategy recall platform.
Origin
Background
Content platforms like TikTok, Twitter, and Facebook often encounter several challenges:
- Keyword Search: How can you efficiently find specific comments under a particular post using keywords?
- User-specific Comments: How can you locate comments made by a specific user under a specific post?
- Complex Data Retrieval: How can you retrieve data using complex strategies without relying on data analysts?
These issues are common for content operations and content safety teams on a daily basis.
So, What’s the Solution?
The solution is straightforward:
- Freeloader: Request the required data from data engineers by specifying your needs. Provide a SQL query to retrieve the data for you.
No no no no no, no way, I don’t want to become a SQL machine. I want automatic solutions where you retrieve the data by yourself. If you’re not familiar with SQL, don’t worry. Let’s design it together.
So the practical solutions are:
- Abstract Features: Abstract features related to different entities.
- Combine and Search: Integrate and search these features effectively.
More specifically:
- Complete Data Access: Ensure you have access to complete content data to avoid missing crucial information.
- Target Retrieval Capabilities: Implement methods to retrieve specific targets from large datasets, including word matching, image/video/audio similarity matching, and model threshold matching.
-
Entity Combination: Enable the ability to combine different entities (such as
user
,post
,comment
, etc.) for more precise searches. - Speed: Ensure that the system operates as quickly as possible.
Abstraction
-
Near Real-Time data
- The volume of data is immense, making it impractical to calculate features on a case-by-case basis.
- Implement a system where data is added to storage daily or even hourly to keep up with the influx of information.
-
Entities
- Within the same product or business, different entities can be combined using one or multiple keys. For example, use
post_id
to linkpost
andcomment
, oruser_id
to connectuser
,post
, andcomment
. - Each entity carries different features that can be sourced from a Data Warehouse.
- Within the same product or business, different entities can be combined using one or multiple keys. For example, use
-
Feature → Rule → Strategy
-
Feature = Characteristics of an entity
-
Main Features: Defined by users or sourced from the data warehouse. Examples include
title
,content
,OCR text
,ASR text
,image
,video
, etc. These are static and serve as inputs for derived features. -
Behavior Features: Defined by user interactions and sourced from the data warehouse. Examples include
view_count
,comment_count
,like_count
, etc. These are dynamic and are used when constructing rules. -
Environment Features: Defined by physical devices and sourced from the data warehouse. Examples include
location
,post_time
,device_identity
, etc. These are static and are used in rule construction. -
Derived Features: Defined and calculated by the platform using various methods. Examples include
title matches word1
,image1 is similar to image2
, etc. These are dynamic and used in rules.
-
Main Features: Defined by users or sourced from the data warehouse. Examples include
-
Rule = Feature + Match + Value
- Feature has characteristic itself.
- ID Feature:
post_id
,comment_id
,user_id
etc. - Text Feature:
title
,content
,ASR
,OCR
,text
,user_name
,user_bio
etc. - Metric Feature:
like_cnt
,view_cnt
etc. - Enumeration Feature:
is_promoted
,city
,risk_level
etc. - Array Feature:
hit_word_id_list
,tag_id_list
etc. - Media Feature:
image
,video
etc.
- ID Feature:
- Categorize Feature, assign appropriate Match, collect reusable Value.
- Examples
- title matches wordlist 792
- view_count ≥ 100
- comment_id in (32174921883, 98327197428, 84921375721)
- Feature has characteristic itself.
-
Strategy = {Entity‘s [Logic + Rule]} ^ n
- {post’s [title matches wordlist 792 AND view_count ≥ 100]}
- {user’s [post_count > 10 OR comment_count > 1000]} ^ {comment’s [comment_text matches wordlist 659]}
-
Feature = Characteristics of an entity
📌 Features are the foundational elements that are reusable across different strategies.
Strategies constructed from these features can be applied consistently to various entities, enhancing efficiency and flexibility.
For example
-
TikTok
andTwitter
have shared entity —Post
,Post
has feature —title
,title
has rulesview_count_is_greater_than_100
andtitle_matches_hello
,title_matches_hello
isstrategy_3
, so that we can applystrategy_3
onTikTok
andTwitter
. -
TikTok
,Twitter
andYouTube
have shared entity —Post
andComment
, which’s features (title
,view_count
,text
) are shared too, so rules (text_not_matches_wordlist_659
,title_matches_wordlist_792
) are shared, finallystrategy_2
is shared.
Unified Solution
We aim to have an integrated platform that supports strategy-based searches and exhibits the following characteristics:
- Completeness: The data to be searched must be comprehensive. Any missing data could lead to serious consequences.
- Generality: Abstracted features and combined strategies must be preserved for quick reuse in other products.
- Accuracy: Building on completeness, the results retrieved must be precise and error-free.
- Real-Time: The platform must ensure efficient searches, ideally within half an hour.
In addition, based on the different characteristics of derived features, the platform needs:
- Wordlist Library: Manage word lists for different topics, including determining word list types, word list, exemption word lists, and matching methods.
- Material Library: Manage materials for various topics, including defining material types and the materials themselves.
- Model Library: Manage models for different topics, including specifying the required features for each model and the methods for invoking these models.
This platform can save relevant features based on risk control instructions, establish recall strategies, and perform searches across all business data. We refer to this process as strategy recall and the platform as the strategy recall platform (SRP).
Present
Strategy Recall Platform
SRP contains word list management
, material management
, model management
, strategy management
, task management
, feature management
, and entity and product management
, 7 modules. It is
- General: Reusable word lists, materials, models, features, and strategies across different use cases.
- Elegant: Aesthetic design, elegant framework, and graceful implementation.
- Efficient: Efficient strategy construction and effective recall case handling.
Modules
- Word List Management: Add, update, or delete word lists; define the topic, type, and content of each word list.
- Material Management: Add, update, or delete materials; define material topics, descriptions, URLs, and other relevant details.
- Model Management: Managed by data scientists and data engineers, not designed for platform users.
- Strategy Management: Add, update, or delete strategies; define applicable products, topics, and the strategy itself.
- Task Management: Add, update, or delete tasks; select a strategy to trigger a task for data retrieval.
-
Feature Management: Add, update, or delete features. Each feature has attributes and can be of type
string
,date
,int
etc. Define applicable match methods, such asword match
,=
,>=
,in
, orsimilar to
. Additionally, create composite features based on existing features, e.g.,concat(title, content).
- Entity and Product Management: Used by platform managers to oversee products, entities, and their connections, not intended for normal users.
Terminology
- Strategy Recall: Constructing strategies based on features to search through all business data.
- Strategy Recall Platform: The platform that implements strategy-based recall.
-
Word List: Abstracted and standardized general business word lists, which can be categorized as:
- Simple Words: Match exactly.
- Text deduplication: Calculating text similarity using edit distance.
- Similar Characters: Regex matching for visually or phonetically similar words.
- Acrostic Poems: Regex matching for acrostic poems.
- Regex: General regex matching.
- Material: Abstracted and standardized general business materials, which can be categorized as images, audio, and video.
- Model: Abstracted and standardized general business model.
- Combined Strategies: Abstracted and standardized general business strategies that can be applied across different products.
-
Basic Objects: Classified based on different business entities, including
posts
,comments
, andusers
. For example,shorts
can be treated asposts
due to the similar business meaning. - General Objects: Refers to business entities.
-
Filtered Objects: Further classification of general objects. For example,
comments
can be divided intocomments
andreplies
,users
can be divided intopost_users
,comment_users
andreply_users
. A strategy can filter multiple filtered objects, such asposts
,comments
, andreplies
simultaneously. -
Recall Objects: The targets of a strategy’s recall, selected from filtered objects. For instance, a strategy may filter
posts
andcomments
but recall onlycomments
. - Strategy Pattern: Combinations of filtered objects in a strategy, which can be extended. The validity of these combinations is defined by the platform.
User Story
- User: Content operations teams, analysts, and content safety teams.
-
User Stories:
- I have multiple word lists for different topics and want to know the number of posts related to each topic.:
- Use
word ****list management
to add the relevant word lists. - On the
strategy management
page, select the products, choosepost
as the filter object, apply the rule thattitle_description matches word lists
, and commit the task.
- Use
- I have a video seed and want to find out how many positive comments are under similar videos.:
- Use
word ****list management
to add word lists related to positive topics. - Use
material management
to add the video seed. - On the
strategy management page
, select the products and choosepost
andcomment
as filter objects, selectcomment
as the recall object. Forpost
, apply the rule video is similar to material. Forcomment
, apply the ruletext matches word lists
, and commit the task.
- Use
- I want to analyze the behavior of teenagers on videos with titles containing specific words from a word list*:*
- Use
word ****list management
to add word lists related to a specific topic. - On the
strategy management page
, select the products and choosepost
,comment
, andcomment_user
as filter objects, selectcomment
as the recall object. Forpost
, apply the ruletitle matches word lists
. Forcomment
, apply no rules. Foruser
, apply the rule thatuser is a teenager
, and commit the task.
- Use
- I have multiple word lists for different topics and want to know the number of posts related to each topic.:
Procedure
Modules E-R
Filtered Object E-R
📌 Generally thinking, strategy pattern is a subgraph of filtered object E-R.
Technical Details
📌 This section primarily discusses the backend implement, especially the data pipelines. Although I have some knowledge of React and can write frontend code, I consider myself a rookie in that area.
The data access permission is controlled by a RBAC service, I won’t discuss it either.
Technology Selection
-
Overview
- Backend Service: Node.js for MongoDB CRUD.
- Data Service: Python for task aggregation, SQL combination, scheduling task and submitting Spark / Presto jobs.
- Frontend: React.js + https://semi.design/zh-CN.
-
Business Data Storage
- Hudi MergeOnRead & Hive: Compatible with SQL, offering comprehensive metadata management and ease of calculating bucket numbers.
- Parquet: After several benchmarks, parquet was chosen, it’s efficient for columnar storage, optimizing data retrieval and storage.
-
Online Database
- MongoDB: Initially, I didn’t design the data structure optimally, leading to numerous changes. Despite this, MongoDB has proven to be a suitable choice due to its flexibility and performance.
-
Language
- JavaScript: MongoDB CRUD & frontend code
- Python3: Used for PySpark code, chosen for its ease of use and integration with big data processing.
- Java + Rust: Employed for high-performance word matching algorithms, balancing speed and efficiency.
Data Flow
Limitations
Real time (T) data only has basic features, it doesn’t support too complex strategies, so we offer a switch to users that determines whether the task could retrieve real time data.
Combined Strategy Framework
📌 The strategy pattern involves combining filter objects. By defining join conditions between entities in the strategy pattern according to the join order, you can create an extensible framework for combined strategies.
----------------------------- Insert SQL Block -------------------------------
insert into table abcdef
select
yy_.*
from
--------------------------- First Filter SQL Block ---------------------------
(
select
xxx
from
xxx
where
xxx
) xx_
--------------------------- Second Filter SQL Block --------------------------
join (
select
yyy
from
yyy
where
yyy
) yy_ on xx_.xxx = yy_.yyy
--------------------------- Third Filter SQL Block ---------------------------
join (
select
zzz
from
zzz
where
zzz
) zz_ on yy_.yyy = zz_.zzz
// for example
"post+all_comment+all_reply": {
// post is the first filter sql block, it doesn't need a join condition.
"all_comment": "all_comment_.post_id = post_.post_id",
"all_reply": "all_reply_.post_id = all_comment_.post_id and (all_comment_.comment_id = all_reply_.parent_id or all_comment_.comment_id = all_reply_.sibling_id)"
}
// real sql example
// I use dynamic partition here, it's a part of task aggregation.
insert overwrite table xxx partition (partition_id)
select
post_.post_id as case_id
map_concat(post_.hit_text_map) as hit_text_map,
map_concat(post_.hit_word_map) as hit_word_map,
map_concat(post_.extra_column_map) as extra_map,
post_.partition_id
from
(
select
post_.*,
map(
'post_title_1', title,
'post_ocr_1', extra_map['ocr']
) as hit_text_map,
map(
'post_title_1', hit_word_title_1[partition_id],
'post_ocr_1', hit_word_ocr_1[partition_id]
) as hit_word_map,
map(
'post',
to_json(
named_struct(
'post_id', post_id,
'title', title,
'publish_time', publish_time,
'user_id', user_id
)
)
) as extra_column_map,
partition_id
from
(
select
*,
post_title_1_word_match(title) as hit_word_title_1,
post_ocr_1_word_match(extra_map['ocr']) as hit_word_ocr_1
from
yyy
where
date = '$v_curr_date'
and product = 'tiktok'
) post_ lateral view explode (
filter(
array(
if(
hit_word_title_1['partition_id_1'] is not null
and hit_word_ocr_1['partition_id_1'] is not null,
'partition_id_1',
null
),
if(
hit_word_title_1['partition_id_2'] is not null
and publish_time >= '2024-01-01'
'partition_id_2',
null
),
if(
hit_word_ocr_1['partition_id_3'] is not null
and comment_count >= 100
'partition_id_3',
null
)
),
x -> x is not null
)
) explode_partition_id as partition_id
) post_
Performance Optimization
📌 "There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." —— Donald Knuth
Storage
Data Split
- Reduce Data Volume: Limit the amount of data involved in querying to improve performance.
- Visibility-Based Split: Separate visible content from non-visible content, as non-visible content may often be irrelevant.
- Time-Based Split: Divide data by publish time, focusing on recent data when needed.
- Trade-Off: While data duplication may occur, it is generally acceptable given the performance benefits.
Bucket
- Point Query Acceleration: Improve the speed of point queries.
- Join Shuffle Reduction: Eliminate or significantly reduce shuffle time in join operations.
- Data Bucketing: Ensures uniform distribution of data, effectively preventing data skew.
- Additional Benefits: Data within each bucket is ordered and compressed efficiently, saving storage space.
- Trade-Off: Writing to Hudi tables introduces an extra shuffle layer, but in strategy recall scenarios, read operations are more common.
Entity | Bucket Key | Why |
---|---|---|
Post | post_id | Primary key |
Comment | post_id | * Visible Post + Visible Comment |
* Comment + Reply, add post_id as join key, utilize bucket super set join. | ||
User | user_id | Primary key |
UDF
Simplify UDF
- UDF Efficiency: Python UDFs are inherently less efficient than native Spark functions.
- UDF Simplicity: Keep UDFs simple, focusing only on essential tasks. For example, Matcher should only perform matching.
- Operator Handling: Implement other operators (AND, OR, NOT, >=, REGEXP, etc.) directly in Spark SQL.
Python
-
Faster packages:
orjson
,msgspec
,rapidfuzz
,ahocorasick_rs
.
Replace Python with Java + Rust
- Java: Avoids data serialization overhead between JVM and Python programs.
- Rust: Offers performance comparable to C++, enhancing speed and efficiency.
- Matcher: https://github.com/Lips7/Matcher
Word List Aggregation
- Maximize Rust Matcher Use: Leverage the Rust matcher to its fullest extent.
- Aggregation Process: Combine word lists from various strategies and tasks into a single integrated input for the matcher, then expand it as needed.
Spark SQL
Spark SQL
- Project Pushdown: Retrieve only the necessary columns to reduce I/O overhead.
- Filter Pushdown: Apply filters as early as possible to minimize the amount of data processed.
- ParquetVectorizedReader: Utilize vectorized technology to read Parquet files more efficiently.
- MergeGetMultiMapValue: Read multiple values from a map in a single operation to enhance performance.
- Cost-Based-Optimization: Reorder join sequence for better sql performance.
Arrow
- Serialization/Deserialization Acceleration: Use Arrow to speed up the process of serialization and deserialization between JVM and Python programs.
-
Configuration: Increase
spark.sql.execution.arrow.maxRecordsPerBatch
to optimize performance by processing larger batches. - Trade-Off: If the value is too large, OOM happens.
Further Optimization
Composite Text Features
-
Feature Abstraction: Allows for the creation of composite features such as
title_content = concat(title, content)
andconcat(asr, ocr)
. Building a single UDF is generally faster than constructing two separate UDFs. - Trade-Off: Logical word matching may result in false positives.
Task Aggregation
- Efficient Processing: Read data once and compute results for multiple different tasks to optimize performance and reduce overall redundancy.
- Implementation: Create a 5-minute time window during which tasks are partitioned by product, strategy pattern, and partition. These are then aggregated into a single Spark task and submitted to YARN.
Go further and beyond
- JAVA UDF Integration: By implementing JAVA UDFs, we can leverage faster engines like Presto for executing SQL queries.
- Pre-Compute Text Transformations: Since text processing accounts for 99% of our tasks, we can pre-compute transformations (such as converting Traditional Chinese to Simplified Chinese, text deletion, and normalization) in advance.
- Dynamic Time Window: Instead of using a fixed 5-minute window, design a dynamic window that triggers aggregation immediately when the number of tasks reaches a threshold. Otherwise, wait for the 5-minute interval.
- Persistent Spark Driver: Submit tasks to the same Spark driver rather than starting a new Spark driver for each aggregated task.
Operation
Target
The goal is to improve the retrieval process, focusing on Precision Rate and Recall Rate. Our query is defined as a strategy
, which primarily consists of word list
. Additionally, the retrieval process involves material
search for video/image/audio similarity using vector search, and model
trained by data scientists. Since the focus here is on word list
, we will only discuss this one aspect.
Precision Rate
-
Word List: We provide various methods for word matching. For more details, please refer to https://github.com/Lips7/Matcher.
-
Simple Matching: Supports exact and logical matches. It only matches when the text exactly contains the word. For example,
hello&world&hello
will matchhellohelloworld
andworldhellohello
, but nothelloworld
due to the repeated occurrence ofhello
. -
Regex Matching:
-
SimilarChar: Matches similar characters using regex. Given the many variations of characters and words, simple word lists cannot cover all possibilities. For example,
["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]
will matchhelloworld!
,hollowrd?
, andhi🌍~
. A simple word list would require writing 4 x 4 x 3 = 48 entries to cover all cases. - Acrostic: Matches acrostic poems using regex. Although less common, this method is valid.
- Regex: Supports general regex matching without additional explanation.
-
SimilarChar: Matches similar characters using regex. Given the many variations of characters and words, simple word lists cannot cover all possibilities. For example,
-
Similar Matching:
- Levenshtein: Matches similar words using Levenshtein edit distance, which accounts for variations in character edits.
-
Simple Matching: Supports exact and logical matches. It only matches when the text exactly contains the word. For example,
Recall Rate
-
Word List: We offer various text transformations while using word list match.
-
Fanjian: Simplify traditional Chinese characters to simplified ones. Example:
蟲艸
->虫艹
-
Delete: Remove specific characters. Example:
Fu&*iii&^%%*&kkkk
->Fuiiikkkk
-
Normalize: Normalize special characters to identifiable characters. Example:
𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!
->hello world!
-
PinYin: Convert Chinese characters to Pinyin for fuzzy matching. Example:
西安
->xi an
, matches洗按
->xi an
, but not先
->xian
-
PinYinChar: Convert Chinese characters to Pinyin. Example:
西安
->xian
, matches洗按
and先
->xian
-
Fanjian: Simplify traditional Chinese characters to simplified ones. Example:
Business Intelligence
We provide Business Intelligence (BI) tools to users for detailed analysis of word, material, model, and strategy matches. Key features include:
- Match Details: View detailed information on how each word, material, model, and strategy performs.
-
Statistics: Access useful statistics such as recall count of each task or strategy, precision rate based on user marks (if provided), distribution of feature category, for example,
view_count < 100
,100 <= view_count < 1000
and1000 <= view_count
. - Auto Word Recommendation: Receive recommendations for words based on retrieved text.
Summary
In a nutshell, this is a general and highly abstracted framework. Although it was implemented using content data, it is not limited to this domain—it’s applicable to various types of data. Actually, after that, I successfully apply this framework to a content moderation system, to help moderators sampling data efficiently. Even if you only have one product, this framework can still be useful.
It empowers users with limited technical skills to retrieve data easily, leveraging their business knowledge. It enriches the variety of strategies available and inspires user creativity. This is the essence of the Strategy Recall Platform.
Top comments (0)