SQLBoy

Catalogue
  1. 一、SQL分类
    1. 1.0 数据查询类(基础查询)
    2. 1.1 数据聚合类(指标计算)
    3. 1.2 数据清洗类(ETL处理)
    4. 1.3 数据关联类(多表JOIN)
    5. 1.4 窗口分析类(高级分析)
    6. 1.5 数据建模类(构建汇总表)
    7. 1.6 调度执行类(参数化SQL)
  2. 二、SQL分类v2
    1. 2.1 按分析类型划分
      1. 2.1.1 描述性分析(What Happened?)
      2. 2.1.2 诊断性分析(Why Did It Happen?)
      3. 2.1.3 预测性分析(What Will Happen?)
      4. 2.1.4 规范性分析(How to Act?)
    2. 2.2 按看板类型划分
      1. 2.2.1 监控型看板(实时监控核心指标)
      2. 2.2.2 探索型看板(多维下钻分析)
      3. 2.2.3 诊断型看板(问题定位与归因)
      4. 2.2.4 战略型看板(长期趋势与规划)
  3. 三、优化经验
    1. 2.1 JOIN优化
    2. 2.2 数据倾斜处理
  4. 三、总结

基于数据分析与离线看板场景,SQL的用途主要分以下几类:

SQL查询逻辑(逻辑查询处理的步骤序号)

1
2
3
4
5
6
7
8
9
10
(8) SELECT (9) DISTINCT<select_list>
(1) FROM <left_table>
(3) <join_type> JOIN <right_table>
(2) ON <join_condition>
(4) WHERE <where_condition>
(5) GROUP BY <group_by_list>
(6) WITH {CUBE|ROLLUP}
(7) HAVING <having_condition>
(10) ORDER BY <order_by_list>
(11) LIMIT <limit_number>

类型 说明 典型SQL操作 应用场景
数据查询类 从表中提取数据,用于临时分析或看板展示 SELECT, WHERE, ORDER BY, LIMIT 临时查询、探索性分析
数据聚合类 对数据进行汇总统计,生成指标 GROUP BY, SUM, COUNT, AVG 日报、月报、KPI看板
数据清洗类 清洗、转换原始数据,提升数据质量 CASE WHEN, COALESCE, TRIM, CAST ETL流程、数据预处理
数据关联类 多表连接,构建宽表或关联分析 JOIN (INNER/LEFT/RIGHT), UNION 用户行为分析、维度打宽
窗口分析类 基于时间或分组进行高级分析 ROW_NUMBER(), RANK(), LEAD/LAG 用户留存、排名、趋势分析

一、SQL分类

1.0 数据查询类(基础查询)

用途:快速获取原始数据,用于临时分析或简单过滤。

1
2
3
4
5
6
7
8
9
10
-- 查询某日活跃用户
SELECT
user_id,
device_type,
login_time
FROM user_logs
WHERE dt = '2025-12-20'
AND login_time >= '08:00:00'
ORDER BY login_time DESC
LIMIT 100;

优化建议:

对 dt 和 login_time 建立分区/索引
避免无 LIMIT 的全表扫描

1.1 数据聚合类(指标计算)

1
2
3
4
5
6
7
8
9
10
-- 计算每日DAU和GMV
SELECT
dt,
COUNT(DISTINCT user_id) AS dau,
SUM(order_amount) AS gmv,
AVG(order_amount) AS avg_order_value
FROM dwd_orders
WHERE dt BETWEEN '2025-12-01' AND '2025-12-20'
GROUP BY dt
ORDER BY dt;

1.2 数据清洗类(ETL处理)

用途:清洗脏数据、标准化字段、补全缺失值。
(一般在ETL阶段完成清洗,避免在查询时处理)

1
2
3
4
5
6
7
8
9
10
11
-- 清洗用户表(去空、格式标准化)
INSERT OVERWRITE TABLE clean_users
SELECT
user_id,
COALESCE(age, 0) AS age, -- 空值补0
UPPER(TRIM(region)) AS region, -- 去空格并大写
CASE
WHEN device IN ('ios', 'android') THEN device
ELSE 'other'
END AS device_type
FROM raw_users;

1.3 数据关联类(多表JOIN)

构建宽表,关联用户、订单、商品等维度。

1
2
3
4
5
6
7
8
9
10
11
-- 用户订单宽表(关联用户属性)
SELECT
o.order_id,
o.user_id,
u.age_group,
u.membership_level,
o.order_amount,
p.category AS product_category
FROM fact_orders o
LEFT JOIN dim_users u ON o.user_id = u.user_id
LEFT JOIN dim_products p ON o.product_id = p.product_idWHERE o.dt = '2025-12-20';

优化建议:

小表(如 dim_users)广播JOIN(Spark中加 /+ BROADCAST(u) /)
优先过滤 dt 分区

1.4 窗口分析类(高级分析)

用途:计算排名、留存、同比环比等复杂逻辑。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- 用户消费金额排名(按地区分区)
SELECT
user_id,
region,
total_spent,
RANK() OVER (PARTITION BY region ORDER BY total_spent DESC) AS rank
FROM (
SELECT
user_id,
region,
SUM(order_amount) AS total_spent
FROM dwd_orders
WHERE dt >= '2025-12-01'
GROUP BY user_id, region
) t;

-- 计算7日留存
SELECT
a.user_id,
a.dt AS install_date,
b.dt AS active_date,
DATEDIFF(b.dt, a.dt) AS retention_day
FROM user_installs a
LEFT JOIN user_logs b ON a.user_id = b.user_id
AND b.dt BETWEEN a.dt AND DATE_ADD(a.dt, 7)
WHERE a.dt = '2025-12-15';

优化建议:

窗口函数配合 WHERE 限制数据范围
留存分析使用 DATEDIFF 替代笛卡尔积

1.5 数据建模类(构建汇总表)

用途:创建物化表,加速后续查询。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- 创建用户行为宽表
CREATE TABLE dwd_user_behavior_wide AS
SELECT
f.user_id,
d1.age_group,
d2.device_type,
f.event_type,
f.event_timeFROM fact_events f
JOIN dim_user d1 ON f.user_id = d1.user_id
JOIN dim_device d2 ON f.device_id = d2.device_id;

-- 生成月度汇总表
INSERT OVERWRITE TABLE mau_summary
SELECT
DATE_FORMAT(dt, 'yyyy-MM') AS month,
COUNT(DISTINCT user_id) AS mau,
SUM(revenue) AS monthly_revenue
FROM dwd_orders
WHERE dt >= '2025-01-01'
GROUP BY DATE_FORMAT(dt, 'yyyy-MM');

优化建议:

定期调度更新汇总表
使用分区表按时间归档

1.6 调度执行类(参数化SQL)

用途:配合Airflow等工具实现自动化。

1
2
3
4
5
6
7
8
-- 参数化日报(${bizdate} 由调度系统传入)
INSERT INTO daily_report
SELECT
'${bizdate}' AS dt,
COUNT(DISTINCT user_id) AS dau,
SUM(order_amount) AS gmv
FROM dwd_orders
WHERE dt = '${bizdate}';

优化建议:

使用变量替换硬编码日期
记录任务日志,便于排查失败原因

二、SQL分类v2

2.1 按分析类型划分

2.1.1 描述性分析(What Happened?)

目标:回答“发生了什么”,通过聚合统计描述现状。
典型看板:日报、周报、核心指标监控面板
SQL特点:简单聚合、时间维度分组、低复杂度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- 每日DAU和GMV(日报)
SELECT
dt,
COUNT(DISTINCT user_id) AS dau,
SUM(order_amount) AS gmv,
ROUND(AVG(order_amount), 2) AS avg_order_value
FROM dwd_orders
WHERE dt BETWEEN '2025-12-01' AND '2025-12-20'
GROUP BY dt
ORDER BY dt;

-- 用户地域分布(静态快照)
SELECT
region,
COUNT(user_id) AS user_count
FROM dim_users
WHERE dt = '2025-12-20'
GROUP BY region;

2.1.2 诊断性分析(Why Did It Happen?)

目标:定位问题原因,通过下钻、对比、关联分析。
典型看板:问题归因看板、漏斗分析、多维下钻报表
SQL特点:多表JOIN、条件过滤、子查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- 订单下降归因(按渠道+设备下钻)
SELECT
channel,
device_type,
COUNT(order_id) AS order_count,
SUM(order_amount) AS gmv
FROM dwd_orders
WHERE dt >= '2025-12-15'
AND dt <= '2025-12-20'
GROUP BY channel, device_type
ORDER BY gmv DESC;

-- 漏斗转化分析(多步骤JOIN)
WITH step1 AS (
SELECT COUNT(DISTINCT user_id) AS visit_users
FROM user_events
WHERE event_type = 'page_view' AND dt = '2025-12-20'
),
step2 AS (
SELECT COUNT(DISTINCT user_id) AS cart_users
FROM user_events
WHERE event_type = 'add_to_cart' AND dt = '2025-12-20'
)
SELECT
visit_users,
cart_users,
ROUND(cart_users * 100.0 / visit_users, 2) AS conversion_rate
FROM step1, step2;

2.1.3 预测性分析(What Will Happen?)

目标:基于历史数据预测未来趋势(需结合机器学习)。
典型看板:销量预测、用户流失预警
SQL特点:窗口函数、时间序列计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- 移动平均预测(7日平滑)
SELECT
dt,
gmv,
AVG(gmv) OVER (ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma_7d
FROM daily_gmv_summary
WHERE dt >= '2025-11-20';

-- 用户流失预警(历史活跃度对比)
SELECT
user_id,
last_active_date,
DATEDIFF('2025-12-20', last_active_date) AS inactive_days,
CASE
WHEN DATEDIFF('2025-12-20', last_active_date) > 7 THEN 'high_risk'
ELSE 'normal'
END AS churn_risk
FROM user_last_active;

优化建议:

使用物化视图加速时间序列计算
避免全表窗口函数,限制数据范围

2.1.4 规范性分析(How to Act?)

目标:提供行动建议,如A/B测试结果、策略效果评估。
典型看板:实验分析看板、策略ROI看板
SQL特点:条件分支、对比组分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- A/B测试转化率对比
SELECT
test_group,
COUNT(DISTINCT user_id) AS users,
SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) AS conversions,
ROUND(100.0 * SUM(converted) / COUNT(DISTINCT user_id), 2) AS conversion_rate
FROM ab_test_results
WHERE dt = '2025-12-20'
GROUP BY test_group;

-- 优惠券ROI分析
SELECT
coupon_type,
SUM(revenue) AS revenue,
SUM(coupon_cost) AS cost,
ROUND((SUM(revenue) - SUM(coupon_cost)) / SUM(coupon_cost), 2) AS roi
FROM marketing_campaigns
WHERE dt >= '2025-12-01'
GROUP BY coupon_type;

2.2 按看板类型划分

2.2.1 监控型看板(实时监控核心指标)

特点:高频刷新、简单图表(折线图、柱状图)
SQL优化重点:预聚合、低延迟

1
2
3
4
5
6
7
8
9
-- 实时GMV监控(5分钟级)
SELECT
DATE_FORMAT(event_time, 'yyyy-MM-dd HH24:mi') AS time_window,
SUM(order_amount) AS gmv
FROM dwd_orders
WHERE dt = '2025-12-20'
AND event_time >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY time_window
ORDER BY time_window;

优化建议:

使用增量更新(如 WHERE event_time > LAST_UPDATE_TIME)
物化视图预计算分钟级聚合

2.2.2 探索型看板(多维下钻分析)

特点:交互式分析、支持筛选/下钻/对比
SQL优化重点:宽表建模、索引优化

1
2
3
4
5
6
7
8
9
10
11
12
-- 用户行为宽表(支持多维筛选)
SELECT
user_id,
age_group,
device_type,
region,
COUNT(event_id) AS event_count,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count
FROM dwd_user_behavior_wide
WHERE dt = '2025-12-20'
AND region IN ('North', 'South') -- 支持前端筛选
GROUP BY user_id, age_group, device_type, region;

优化建议:

构建星型模型(事实表+维度表)
对筛选字段(如 region)建立索引

2.2.3 诊断型看板(问题定位与归因)

特点:多维度对比、下钻路径清晰
SQL优化重点:分层查询、避免重复计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- 订单异常归因(分层下钻)
WITH daily_orders AS (
SELECT
dt,
COUNT(order_id) AS order_count
FROM dwd_orders
WHERE dt BETWEEN '2025-12-15' AND '2025-12-20'
GROUP BY dt
),
region_breakdown AS (
SELECT
region,
COUNT(order_id) AS order_count
FROM dwd_orders
WHERE dt = '2025-12-20'
GROUP BY region
)
SELECT
'Overall' AS level,
dt AS dimension,
order_count
FROM daily_orders
UNION ALL
SELECT
'By Region' AS level,
region AS dimension,
order_count
FROM region_breakdown;

优化建议:

使用 UNION ALL 合并多层级结果
限制下钻深度(避免无限递归)

2.2.4 战略型看板(长期趋势与规划)

特点:宏观指标、长期趋势、对比分析
SQL优化重点:时间序列优化、同比环比

1
2
3
4
5
6
7
8
9
10
11
12
-- 年度GMV趋势(同比)
SELECT
year,
SUM(gmv) AS annual_gmv,
LAG(SUM(gmv), 1) OVER (ORDER BY year) AS prev_year_gmv,
ROUND(
(SUM(gmv) - LAG(SUM(gmv), 1) OVER (ORDER BY year)) * 100.0 /
LAG(SUM(gmv), 1) OVER (ORDER BY year), 2
) AS yoy_growth
FROM monthly_gmv_summary
GROUP BY year
ORDER BY year;

优化建议:

使用 LAG/LEAD 计算同比环比
预计算年度汇总表

三、优化经验

2.1 JOIN优化

优化前:大表JOIN小表,无过滤

1
2
3
SELECT a.*, b.name 
FROM big_table a
LEFT JOIN small_table b ON a.id = b.id; -- 可能触发笛卡尔积

优化后:

1
2
3
SELECT a.*, b.name 
FROM (SELECT * FROM big_table WHERE dt = '2025-12-20') a -- 先过滤
LEFT JOIN small_table b ON a.id = b.id; -- 小表广播JOIN

2.2 数据倾斜处理

1
2
3
4
-- 某用户ID数据量极大,导致Task卡死
SELECT user_id, SUM(amount)
FROM orders
GROUP BY user_id;

解决方案(加盐):

1
2
3
4
5
SELECT 
CONCAT(user_id, '_', FLOOR(RAND() * 10)) AS salted_key,
SUM(amount)
FROM orders
GROUP BY salted_key;

三、总结

最佳实践:
优先预计算:高频指标用汇总表存储。
分区为王:时间分区是性能基石。
避免黑洞操作:如 SELECT *、无过滤JOIN。
监控慢查询:使用 EXPLAIN 分析执行计划。

分类 关键点 示例工具
查询类 分区过滤、字段精简 Hive/Spark SQL
聚合类 预计算、避免重复扫描 ClickHouse物化视图
清洗类 提前处理,避免运行时计算 dbt
关联类 小表广播、分区JOIN Spark Broadcast Join
窗口类 限制数据范围 Presto/Trino
建模类 定期调度更新 Airflow + SQL
调度类 参数化、日志追踪 DolphinScheduler