In the era of big data, managing data efficiently has become critical for businesses to optimize storage costs and improve system performance. For GBase database (GBase数据库), audit logs provide valuable insights into data usage patterns, enabling a structured approach to analyzing data across cold, warm, and hot categories. This article delves into a practical method for leveraging audit logs to classify tables based on their access history, offering actionable insights for lifecycle management and resource optimization.
Abstract
Data lifecycle management is a common issue in projects. Typically, the solution involves:
- Determining the frequency of table usage based on their creation and last accessed times.
- Classifying tables into cold, warm, and hot categories to facilitate data cleanup or migration.
However, for tables with large amounts of data, this information cannot be quickly or directly retrieved from the cluster metadata. Thus, a programmatic approach is needed to extract and analyze multidimensional metadata, organizing commonly used and infrequently used tables for classification and analysis. This ultimately resolves such issues.
Analysis Approach
In the database, the table_list
field in the audit_log
table records the tables involved in each SQL statement. Each row corresponds to an SQL query and lists the tables, separated by commas. To extract details about the table names and access times, single-row to multi-row transformations are required, which resemble the reverse operation of the GROUP_CONCAT
function.
Audit Log Example:
Desired Format After Transformation:
Expected Outcome:
Analysis of Challenges
Requirement: Transform multi-table strings in rows into multiple rows of data.
Initial Attempt:
The previous solution classified thetable_list
field into single-table and multi-table data. Multi-table data was exported, processed with thesed
command to replace commas with newlines, and then reloaded into the database. However, this approach involved operations outside the database, making it less efficient compared to the current in-database solution.Inspiration:
The idea stemmed from Python'ssplit()
function for string splitting. It inspired the use of theSUBSTRING_INDEX
function for necessary string segmentation.Technical Reference:
The logic of creating a Cartesian product with a cross join was referenced to transform a single row into multiple rows, fundamentally addressing the problem.Innovative Approach:
This solution pioneered using audit log analysis for understanding data usage trends, offering a practical reference for lifecycle management of customer data.
Solution
1) Create a Parameter Table:
A table, test.splist_bryan
, is created to serve as a pseudo-parameter table for the Cartesian product in the cross join.
CREATE TABLE test.splist_bryan (
`id` int(19) NOT NULL,
`num` int(19) DEFAULT NULL
);
2) Initialize the Parameter Table:
A stored procedure inserts 100 rows as parameters.
(Assumes no SQL references more than 100 tables. This limit can be adjusted if needed.)
DELIMITER //
CREATE PROCEDURE test.splist_init()
BEGIN
DECLARE num INT;
SET num = 1;
WHILE num < 100 DO
INSERT INTO test.splist_bryan (id, num) VALUES (num, num);
SET num = num + 1;
END WHILE;
END //
DELIMITER ;
CALL test.splist_init();
3) Create test1
Table:
Summarize audit_log
data by classifying and grouping.
(Here, audit_log
is a combined Express engine table. Adjustments for cross-engine data migration involve enabling _gbase_query_path
.)
CREATE TABLE test.test1 (start_time TIMESTAMP, tbl VARCHAR(5000));
SET GLOBAL _gbase_query_path = 1;
INSERT INTO test.test1
SELECT MAX(start_time) AS start_time,
REPLACE(TRIM(REGEXP_REPLACE(table_list, ';|`|WRITE:|READ:|OTHER:', '')), ' ', ',') AS tbl
FROM gbase.audit_log
WHERE LENGTH(TRIM(REGEXP_REPLACE(table_list, ';|`|WRITE:|READ:|OTHER:', ''))) <> 0
AND NOT REGEXP_LIKE(db, 'gbase|information_schema|performance_schema|gctmpdb|gclusterdb', 'i')
GROUP BY REPLACE(TRIM(REGEXP_REPLACE(table_list, ';|`|WRITE:|READ:|OTHER:', '')), ' ', ',');
Field Descriptions for audit_log
Table:
Field | Description |
---|---|
thread_id |
Thread ID, same as the ID in processlist . |
taskid |
Globally unique task ID. |
start_time |
SQL execution start time. |
end_time |
SQL execution end time. |
user_host |
Logged-in username and host. |
query_time |
Execution duration. |
rows |
Number of rows returned in the result set. |
db |
Current database in use. |
table_list |
Involved tables in the format: <db>.<tb>, ... . |
sql_type |
SQL type (DDL, DML, DQL, etc.). |
4) Transform Rows into Multiple Rows (test2
):
Use a cross join for the Cartesian product and SUBSTRING_INDEX
for segmentation. Filter out system tables.
CREATE TABLE test.test2 AS
SELECT * FROM (
SELECT MAX(start_time) AS mydate, col AS tblist
FROM (
SELECT t.start_time,
SUBSTRING_INDEX(SUBSTRING_INDEX(t.tbl, ',', b.num), ',', -1) AS col
FROM test.test1 AS t
CROSS JOIN test.splist_bryan AS b
ON b.num <= LENGTH(t.tbl) - LENGTH(REPLACE(t.tbl, ',', '')) + 1
) AS m
GROUP BY 2
) AS p
WHERE NOT REGEXP_LIKE(tblist, 'gbase|information_schema|performance_schema|gctmpdb|gclusterdb', 'i');
5) Analysis of the Results
The results from the test2
table display the most recent usage date for each table. Each table occupies one row without duplicate entries, allowing for heat analysis based on usage dates.
Cold, Warm, and Hot Data Analysis
1) Overview of Cold, Warm, and Hot Data
Using the telecommunications industry as an example:
- Data from within 1 year is considered hot.
- Data between 1 and 3 years old is warm.
- Data older than 3 years is classified as cold.
The definitions of cold, warm, and hot data can be customized based on the actual data distribution.
Example:
create table test3 as
select mydate,tblist,substring(tblist,1, INSTR(tblist, '.')-1) as dbname,substring(tblist,INSTR(tblist, '.')+1) as tbname,
case when diff<=256 then 'hot'
when diff>256 and diff<3*256 then 'warm'
else 'cold' end as class
from (select date(mydate) mydate,datediff(sysdate(),date(mydate)) as diff , tblist from test.test2 ) as p;
Details for Each Table Type (Partial Example):
Total Counts for Each Type:
2) Database-Level Cold, Warm, and Hot Data Analysis
Sometimes it is necessary to identify which databases contain the most cold data for prioritizing cleaning or migration efforts. The following query provides an analysis:
select class,dbname,count(tblist) from test.test3 group by class,dbname ;
Example Query:
select dbname,sum(decode(class,'cold',1,0)),
sum(decode(class,'warm',1,0)),
sum(decode(class,'hot',1,0))
from test.test3 group by dbname ;
From the results, the databases gbaserpt
, gbaseods
, gbasedwd
, and gbasemsm
have the highest amounts of cold data. These should be prioritized for data migration or cleanup.
By effectively analyzing audit logs in GBase database (GBase数据库), organizations can uncover meaningful patterns in data usage, aiding in the classification of cold, warm, and hot data. This classification empowers data administrators to make informed decisions on storage optimization, prioritize cleanup or migration efforts, and enhance the overall efficiency of their systems. With the methodologies outlined in this article, businesses can establish a solid foundation for data lifecycle management, ensuring sustainable and cost-effective data strategies.
Top comments (0)