SQLBoy

基于数据分析与离线看板场景,SQL的用途主要分以下几类:

SQL查询逻辑(逻辑查询处理的步骤序号)

1
2
3
4
5
6
7
8
9
10
(8) SELECT (9) DISTINCT<select_list>
(1) FROM <left_table>
(3) <join_type> JOIN <right_table>
(2) ON <join_condition>
(4) WHERE <where_condition>
(5) GROUP BY <group_by_list>
(6) WITH {CUBE|ROLLUP}
(7) HAVING <having_condition>
(10) ORDER BY <order_by_list>
(11) LIMIT <limit_number>

类型 说明 典型SQL操作 应用场景
数据查询类 从表中提取数据,用于临时分析或看板展示 SELECT, WHERE, ORDER BY, LIMIT 临时查询、探索性分析
数据聚合类 对数据进行汇总统计,生成指标 GROUP BY, SUM, COUNT, AVG 日报、月报、KPI看板
数据清洗类 清洗、转换原始数据,提升数据质量 CASE WHEN, COALESCE, TRIM, CAST ETL流程、数据预处理
数据关联类 多表连接,构建宽表或关联分析 JOIN (INNER/LEFT/RIGHT), UNION 用户行为分析、维度打宽
窗口分析类 基于时间或分组进行高级分析 ROW_NUMBER(), RANK(), LEAD/LAG 用户留存、排名、趋势分析

一、SQL分类

1.0 数据查询类(基础查询)

用途:快速获取原始数据,用于临时分析或简单过滤。

1
2
3
4
5
6
7
8
9
10
-- 查询某日活跃用户
SELECT
user_id,
device_type,
login_time
FROM user_logs
WHERE dt = '2025-12-20'
AND login_time >= '08:00:00'
ORDER BY login_time DESC
LIMIT 100;

优化建议:

对 dt 和 login_time 建立分区/索引
避免无 LIMIT 的全表扫描

1.1 数据聚合类(指标计算)

1
2
3
4
5
6
7
8
9
10
-- 计算每日DAU和GMV
SELECT
dt,
COUNT(DISTINCT user_id) AS dau,
SUM(order_amount) AS gmv,
AVG(order_amount) AS avg_order_value
FROM dwd_orders
WHERE dt BETWEEN '2025-12-01' AND '2025-12-20'
GROUP BY dt
ORDER BY dt;

1.2 数据清洗类(ETL处理)

用途:清洗脏数据、标准化字段、补全缺失值。
(一般在ETL阶段完成清洗,避免在查询时处理)

1
2
3
4
5
6
7
8
9
10
11
-- 清洗用户表(去空、格式标准化)
INSERT OVERWRITE TABLE clean_users
SELECT
user_id,
COALESCE(age, 0) AS age, -- 空值补0
UPPER(TRIM(region)) AS region, -- 去空格并大写
CASE
WHEN device IN ('ios', 'android') THEN device
ELSE 'other'
END AS device_type
FROM raw_users;

1.3 数据关联类(多表JOIN)

构建宽表,关联用户、订单、商品等维度。

1
2
3
4
5
6
7
8
9
10
11
-- 用户订单宽表(关联用户属性)
SELECT
o.order_id,
o.user_id,
u.age_group,
u.membership_level,
o.order_amount,
p.category AS product_category
FROM fact_orders o
LEFT JOIN dim_users u ON o.user_id = u.user_id
LEFT JOIN dim_products p ON o.product_id = p.product_idWHERE o.dt = '2025-12-20';

优化建议:

小表(如 dim_users)广播JOIN(Spark中加 /+ BROADCAST(u) /)
优先过滤 dt 分区

1.4 窗口分析类(高级分析)

用途:计算排名、留存、同比环比等复杂逻辑。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- 用户消费金额排名(按地区分区)
SELECT
user_id,
region,
total_spent,
RANK() OVER (PARTITION BY region ORDER BY total_spent DESC) AS rank
FROM (
SELECT
user_id,
region,
SUM(order_amount) AS total_spent
FROM dwd_orders
WHERE dt >= '2025-12-01'
GROUP BY user_id, region
) t;

-- 计算7日留存
SELECT
a.user_id,
a.dt AS install_date,
b.dt AS active_date,
DATEDIFF(b.dt, a.dt) AS retention_day
FROM user_installs a
LEFT JOIN user_logs b ON a.user_id = b.user_id
AND b.dt BETWEEN a.dt AND DATE_ADD(a.dt, 7)
WHERE a.dt = '2025-12-15';

优化建议:

窗口函数配合 WHERE 限制数据范围
留存分析使用 DATEDIFF 替代笛卡尔积

1.5 数据建模类(构建汇总表)

用途:创建物化表,加速后续查询。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- 创建用户行为宽表
CREATE TABLE dwd_user_behavior_wide AS
SELECT
f.user_id,
d1.age_group,
d2.device_type,
f.event_type,
f.event_timeFROM fact_events f
JOIN dim_user d1 ON f.user_id = d1.user_id
JOIN dim_device d2 ON f.device_id = d2.device_id;

-- 生成月度汇总表
INSERT OVERWRITE TABLE mau_summary
SELECT
DATE_FORMAT(dt, 'yyyy-MM') AS month,
COUNT(DISTINCT user_id) AS mau,
SUM(revenue) AS monthly_revenue
FROM dwd_orders
WHERE dt >= '2025-01-01'
GROUP BY DATE_FORMAT(dt, 'yyyy-MM');

优化建议:

定期调度更新汇总表
使用分区表按时间归档

1.6 调度执行类(参数化SQL)

用途:配合Airflow等工具实现自动化。

1
2
3
4
5
6
7
8
-- 参数化日报(${bizdate} 由调度系统传入)
INSERT INTO daily_report
SELECT
'${bizdate}' AS dt,
COUNT(DISTINCT user_id) AS dau,
SUM(order_amount) AS gmv
FROM dwd_orders
WHERE dt = '${bizdate}';

优化建议:

使用变量替换硬编码日期
记录任务日志,便于排查失败原因

二、SQL分类v2

2.1 按分析类型划分

2.1.1 描述性分析(What Happened?)

目标:回答“发生了什么”,通过聚合统计描述现状。
典型看板:日报、周报、核心指标监控面板
SQL特点:简单聚合、时间维度分组、低复杂度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- 每日DAU和GMV(日报)
SELECT
dt,
COUNT(DISTINCT user_id) AS dau,
SUM(order_amount) AS gmv,
ROUND(AVG(order_amount), 2) AS avg_order_value
FROM dwd_orders
WHERE dt BETWEEN '2025-12-01' AND '2025-12-20'
GROUP BY dt
ORDER BY dt;

-- 用户地域分布(静态快照)
SELECT
region,
COUNT(user_id) AS user_count
FROM dim_users
WHERE dt = '2025-12-20'
GROUP BY region;

2.1.2 诊断性分析(Why Did It Happen?)

目标:定位问题原因,通过下钻、对比、关联分析。
典型看板:问题归因看板、漏斗分析、多维下钻报表
SQL特点:多表JOIN、条件过滤、子查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- 订单下降归因(按渠道+设备下钻)
SELECT
channel,
device_type,
COUNT(order_id) AS order_count,
SUM(order_amount) AS gmv
FROM dwd_orders
WHERE dt >= '2025-12-15'
AND dt <= '2025-12-20'
GROUP BY channel, device_type
ORDER BY gmv DESC;

-- 漏斗转化分析(多步骤JOIN)
WITH step1 AS (
SELECT COUNT(DISTINCT user_id) AS visit_users
FROM user_events
WHERE event_type = 'page_view' AND dt = '2025-12-20'
),
step2 AS (
SELECT COUNT(DISTINCT user_id) AS cart_users
FROM user_events
WHERE event_type = 'add_to_cart' AND dt = '2025-12-20'
)
SELECT
visit_users,
cart_users,
ROUND(cart_users * 100.0 / visit_users, 2) AS conversion_rate
FROM step1, step2;

2.1.3 预测性分析(What Will Happen?)

目标:基于历史数据预测未来趋势(需结合机器学习)。
典型看板:销量预测、用户流失预警
SQL特点:窗口函数、时间序列计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- 移动平均预测(7日平滑)
SELECT
dt,
gmv,
AVG(gmv) OVER (ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma_7d
FROM daily_gmv_summary
WHERE dt >= '2025-11-20';

-- 用户流失预警(历史活跃度对比)
SELECT
user_id,
last_active_date,
DATEDIFF('2025-12-20', last_active_date) AS inactive_days,
CASE
WHEN DATEDIFF('2025-12-20', last_active_date) > 7 THEN 'high_risk'
ELSE 'normal'
END AS churn_risk
FROM user_last_active;

优化建议:

使用物化视图加速时间序列计算
避免全表窗口函数,限制数据范围

2.1.4 规范性分析(How to Act?)

目标:提供行动建议,如A/B测试结果、策略效果评估。
典型看板:实验分析看板、策略ROI看板
SQL特点:条件分支、对比组分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- A/B测试转化率对比
SELECT
test_group,
COUNT(DISTINCT user_id) AS users,
SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) AS conversions,
ROUND(100.0 * SUM(converted) / COUNT(DISTINCT user_id), 2) AS conversion_rate
FROM ab_test_results
WHERE dt = '2025-12-20'
GROUP BY test_group;

-- 优惠券ROI分析
SELECT
coupon_type,
SUM(revenue) AS revenue,
SUM(coupon_cost) AS cost,
ROUND((SUM(revenue) - SUM(coupon_cost)) / SUM(coupon_cost), 2) AS roi
FROM marketing_campaigns
WHERE dt >= '2025-12-01'
GROUP BY coupon_type;

2.2 按看板类型划分

2.2.1 监控型看板(实时监控核心指标)

特点:高频刷新、简单图表(折线图、柱状图)
SQL优化重点:预聚合、低延迟

1
2
3
4
5
6
7
8
9
-- 实时GMV监控(5分钟级)
SELECT
DATE_FORMAT(event_time, 'yyyy-MM-dd HH24:mi') AS time_window,
SUM(order_amount) AS gmv
FROM dwd_orders
WHERE dt = '2025-12-20'
AND event_time >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY time_window
ORDER BY time_window;

优化建议:

使用增量更新(如 WHERE event_time > LAST_UPDATE_TIME)
物化视图预计算分钟级聚合

2.2.2 探索型看板(多维下钻分析)

特点:交互式分析、支持筛选/下钻/对比
SQL优化重点:宽表建模、索引优化

1
2
3
4
5
6
7
8
9
10
11
12
-- 用户行为宽表(支持多维筛选)
SELECT
user_id,
age_group,
device_type,
region,
COUNT(event_id) AS event_count,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count
FROM dwd_user_behavior_wide
WHERE dt = '2025-12-20'
AND region IN ('North', 'South') -- 支持前端筛选
GROUP BY user_id, age_group, device_type, region;

优化建议:

构建星型模型(事实表+维度表)
对筛选字段(如 region)建立索引

2.2.3 诊断型看板(问题定位与归因)

特点:多维度对比、下钻路径清晰
SQL优化重点:分层查询、避免重复计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- 订单异常归因(分层下钻)
WITH daily_orders AS (
SELECT
dt,
COUNT(order_id) AS order_count
FROM dwd_orders
WHERE dt BETWEEN '2025-12-15' AND '2025-12-20'
GROUP BY dt
),
region_breakdown AS (
SELECT
region,
COUNT(order_id) AS order_count
FROM dwd_orders
WHERE dt = '2025-12-20'
GROUP BY region
)
SELECT
'Overall' AS level,
dt AS dimension,
order_count
FROM daily_orders
UNION ALL
SELECT
'By Region' AS level,
region AS dimension,
order_count
FROM region_breakdown;

优化建议:

使用 UNION ALL 合并多层级结果
限制下钻深度(避免无限递归)

2.2.4 战略型看板(长期趋势与规划)

特点:宏观指标、长期趋势、对比分析
SQL优化重点:时间序列优化、同比环比

1
2
3
4
5
6
7
8
9
10
11
12
-- 年度GMV趋势(同比)
SELECT
year,
SUM(gmv) AS annual_gmv,
LAG(SUM(gmv), 1) OVER (ORDER BY year) AS prev_year_gmv,
ROUND(
(SUM(gmv) - LAG(SUM(gmv), 1) OVER (ORDER BY year)) * 100.0 /
LAG(SUM(gmv), 1) OVER (ORDER BY year), 2
) AS yoy_growth
FROM monthly_gmv_summary
GROUP BY year
ORDER BY year;

优化建议:

使用 LAG/LEAD 计算同比环比
预计算年度汇总表

三、优化经验

2.1 JOIN优化

优化前:大表JOIN小表,无过滤

1
2
3
SELECT a.*, b.name 
FROM big_table a
LEFT JOIN small_table b ON a.id = b.id; -- 可能触发笛卡尔积

优化后:

1
2
3
SELECT a.*, b.name 
FROM (SELECT * FROM big_table WHERE dt = '2025-12-20') a -- 先过滤
LEFT JOIN small_table b ON a.id = b.id; -- 小表广播JOIN

2.2 数据倾斜处理

1
2
3
4
-- 某用户ID数据量极大,导致Task卡死
SELECT user_id, SUM(amount)
FROM orders
GROUP BY user_id;

解决方案(加盐):

1
2
3
4
5
SELECT 
CONCAT(user_id, '_', FLOOR(RAND() * 10)) AS salted_key,
SUM(amount)
FROM orders
GROUP BY salted_key;

三、总结

最佳实践:
优先预计算:高频指标用汇总表存储。
分区为王:时间分区是性能基石。
避免黑洞操作:如 SELECT *、无过滤JOIN。
监控慢查询:使用 EXPLAIN 分析执行计划。

分类 关键点 示例工具
查询类 分区过滤、字段精简 Hive/Spark SQL
聚合类 预计算、避免重复扫描 ClickHouse物化视图
清洗类 提前处理,避免运行时计算 dbt
关联类 小表广播、分区JOIN Spark Broadcast Join
窗口类 限制数据范围 Presto/Trino
建模类 定期调度更新 Airflow + SQL
调度类 参数化、日志追踪 DolphinScheduler