高频SQL题

Catalogue
  1. 1.每日UV
  2. 2.最长连续登录次数
  3. 3.拆分并统计标签字段
  4. 4. 每天浏览量的TOP
  5. 5. 每日累计充值金额
  6. 6. 用户首购信息
  7. 7.每月活跃用户(MAU)
  8. 8. 会话切割
  9. 9.JSON字段解析
  10. 10.半连接查询
  11. 11.金额区间分桶
  12. 12.用户留存率计算
  13. 13.日活跃用户增长率
  14. 14.交叉标签共线
  15. 15.用户分群(K-means前处理)
  16. 16.多级分类销量汇总
  17. 17.事件类型分布占比
  18. 18.环比和同比
  19. 19.设备多维度统计

1.每日UV

业务意义:衡量平台用户规模、吸引力的核心指标。 核心价值:1. 追踪用户增长趋势,判断平台用户基数变化,2.评估营销活动帮效果。3.为商业推广、运营策略调整提供数据支撑。

2.最长连续登录次数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
SELECT
user_id,
MAX(date_seg) AS max_continuous_days, -- 别名优化:最长单次连续登录天数
SUM(date_seg) AS total_continuous_days -- 别名优化:所有连续段总连续登录天数
FROM (
-- 子查询c:统计每个用户每个连续登录段的天数(单个连续段的长度)
SELECT
user_id,
COUNT(date_group) AS date_seg -- date_group相同为同一个连续段,计数即为该段连续天数
FROM (
-- 子查询b:为每个用户的每日登录记录生成连续段分组标识(date_group)
SELECT
user_id,
dt,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY dt) AS num,
DATE_SUB(
dt,
INTERVAL (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY dt)) DAY
) AS date_group -- 核心:生成连续登录段的统一分组标识
FROM (
-- 子查询a:预处理数据,获取用户每日唯一登录记录(去重,按日聚合)
SELECT
user_id,
DATE(time) AS dt, -- 提取登录日期,忽略时分秒
COUNT(1) AS cnt -- 每日登录次数(无实际业务意义,仅用于分组去重)
FROM user_behavior_v2
GROUP BY user_id, DATE(time)
) AS a
) AS b
GROUP BY user_id, date_group -- 按「用户+连续段分组」统计单个段的天数
) AS c
GROUP BY user_id; -- 按用户聚合,得到最长连续天数和总连续天数

3.拆分并统计标签字段

适用于多标签精细化分析。 社交平台 拆分用户兴趣标签,实现个性化推荐。

1
2
# 实现思路:用split函数按指定分隔符将标签字符串拆为数组。用explode函数将数据展开为独立行。

4. 每天浏览量的TOP

思路:加上row_number 之后筛选top3。

业务意义:热门商品的趋势洞察核心指标。

5. 每日累计充值金额

注意区分:不是总体累计金额

1

6. 用户首购信息

1
2
3
4
5
6
7
with firstData AS (
SELECT user_id, item_id, time,
FIRST_VALUE(item_id) OVER(PARTITION BY user_id ORDER by time) AS firstItem,
FIRST_VALUE(time) OVER(PARTITION BY user_id ORDER by time) AS firstTime
FROM user_behavior_v2
)
SELECT DISTINCT user_id,firstItem,firstTime FROM firstData ORDER BY user_id;

7.每月活跃用户(MAU)

1
2
3
4
5
6
SELECT 
DATE_FORMAT(access_time, 'yyyy-MM') AS month,
COUNT(DISTINCT usere_id) AS mau
FROM
user_access_log
GROUP BY month

8. 会话切割

  1. 先给表添加 上一次请求时间。
  2. 然后 diff 上一次请求时间 和当前时间,大于指定值 则为新session
  3. 累加 生成会话ID(sessionID); 或者可统计 用户会话的次数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
WITH userAccessRank AS (
select user_id,time,LAG(time,1, NULL) OVER(PARTITION BY user_id ORDER BY time) as last_time from user_behavior_v2
),
SessionFlag AS (
select user_id, time, CASE WHEN last_time IS NULL THEN 1 WHEN TIMESTAMPDIFF(HOUR, last_time, time) > 3 THEN 1 ELSE 0 END AS isNewSession
FROM userAccessRank
)
SELECT user_id,time,sum(isNewSession) OVER (PARTITION BY user_id ORDER BY time) as sessionId
FROM SessionFlag
ORDER BY user_id desc, time limit 20;

## 统计用户会话数
WITH userAccessRank AS (
select user_id,time,LAG(time,1, NULL) OVER(PARTITION BY user_id ORDER BY time) as last_time from user_behavior_v2
),
SessionFlag AS (
select user_id, time, CASE WHEN last_time IS NULL THEN 1 WHEN TIMESTAMPDIFF(HOUR, last_time, time) > 3 THEN 1 ELSE 0 END AS isNewSession
FROM userAccessRank
),
UserSession AS (
SELECT
user_id,
time,
SUM(isNewSession) OVER (PARTITION BY user_id ORDER BY time) AS sessionId
FROM SessionFlag
)
SELECT user_id, COUNT(DISTINCT sessionId), MAX(sessionId)
FROM UserSession
GROUP BY user_id
ORDER BY user_id desc limit 20;

9.JSON字段解析

json_tuple()
get_json_object()

10.半连接查询

左表数据筛选、关联效率优化。
相比于 INNER JOIN(可能因右表多条匹配导致左表重复)

Hive中常用 IN 子查询 或 EXISTS 子查询实现 半连接。避免使用 INNER JOIN 导致的左表重复。

11.金额区间分桶

核心是 CASE WHEN

1
2
3
SELECT CASE WHEN item_category = '7344' THEN 'goodsA' WHEN item_category > '10000' THEN 'goodsB' WHEN item_category <10000 THEN 'goodsC'  END as  goodsCategory, 
count(1)
FROM user_behavior_v2 GROUP BY goodsCategory limit 20 ;

12.用户留存率计算

13.日活跃用户增长率

14.交叉标签共线

15.用户分群(K-means前处理)

16.多级分类销量汇总

17.事件类型分布占比

18.环比和同比

19.设备多维度统计