高频SQL题

1.每日UV

业务意义:衡量平台用户规模、吸引力的核心指标。 核心价值:1. 追踪用户增长趋势,判断平台用户基数变化,2.评估营销活动帮效果。3.为商业推广、运营策略调整提供数据支撑。

2.最长连续登录次数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
SELECT
user_id,
MAX(date_seg) AS max_continuous_days, -- 别名优化:最长单次连续登录天数
SUM(date_seg) AS total_continuous_days -- 别名优化:所有连续段总连续登录天数
FROM (
-- 子查询c:统计每个用户每个连续登录段的天数(单个连续段的长度)
SELECT
user_id,
COUNT(date_group) AS date_seg -- date_group相同为同一个连续段,计数即为该段连续天数
FROM (
-- 子查询b:为每个用户的每日登录记录生成连续段分组标识(date_group)
SELECT
user_id,
dt,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY dt) AS num,
DATE_SUB(
dt,
INTERVAL (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY dt)) DAY
) AS date_group -- 核心:生成连续登录段的统一分组标识
FROM (
-- 子查询a:预处理数据,获取用户每日唯一登录记录(去重,按日聚合)
SELECT
user_id,
DATE(time) AS dt, -- 提取登录日期,忽略时分秒
COUNT(1) AS cnt -- 每日登录次数(无实际业务意义,仅用于分组去重)
FROM user_behavior_v2
GROUP BY user_id, DATE(time)
) AS a
) AS b
GROUP BY user_id, date_group -- 按「用户+连续段分组」统计单个段的天数
) AS c
GROUP BY user_id; -- 按用户聚合,得到最长连续天数和总连续天数

3.拆分并统计标签字段

适用于多标签精细化分析。 社交平台 拆分用户兴趣标签,实现个性化推荐。

1
2
# 实现思路:用split函数按指定分隔符将标签字符串拆为数组。用explode函数将数据展开为独立行。

4. 每天浏览量的TOP

思路:加上row_number 之后筛选top3。

业务意义:热门商品的趋势洞察核心指标。

5. 每日累计充值金额

注意区分:不是总体累计金额

1

6. 用户首购信息

1
2
3
4
5
6
7
with firstData AS (
SELECT user_id, item_id, time,
FIRST_VALUE(item_id) OVER(PARTITION BY user_id ORDER by time) AS firstItem,
FIRST_VALUE(time) OVER(PARTITION BY user_id ORDER by time) AS firstTime
FROM user_behavior_v2
)
SELECT DISTINCT user_id,firstItem,firstTime FROM firstData ORDER BY user_id;

7.每月活跃用户(MAU)

1
2
3
4
5
6
SELECT 
DATE_FORMAT(access_time, 'yyyy-MM') AS month,
COUNT(DISTINCT usere_id) AS mau
FROM
user_access_log
GROUP BY month

8. 会话切割

  1. 先给表添加 上一次请求时间。
  2. 然后 diff 上一次请求时间 和当前时间,大于指定值 则为新session
  3. 累加 生成会话ID(sessionID); 或者可统计 用户会话的次数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
WITH userAccessRank AS (
select user_id,time,LAG(time,1, NULL) OVER(PARTITION BY user_id ORDER BY time) as last_time from user_behavior_v2
),
SessionFlag AS (
select user_id, time, CASE WHEN last_time IS NULL THEN 1 WHEN TIMESTAMPDIFF(HOUR, last_time, time) > 3 THEN 1 ELSE 0 END AS isNewSession
FROM userAccessRank
)
SELECT user_id,time,sum(isNewSession) OVER (PARTITION BY user_id ORDER BY time) as sessionId
FROM SessionFlag
ORDER BY user_id desc, time limit 20;

## 统计用户会话数
WITH userAccessRank AS (
select user_id,time,LAG(time,1, NULL) OVER(PARTITION BY user_id ORDER BY time) as last_time from user_behavior_v2
),
SessionFlag AS (
select user_id, time, CASE WHEN last_time IS NULL THEN 1 WHEN TIMESTAMPDIFF(HOUR, last_time, time) > 3 THEN 1 ELSE 0 END AS isNewSession
FROM userAccessRank
),
UserSession AS (
SELECT
user_id,
time,
SUM(isNewSession) OVER (PARTITION BY user_id ORDER BY time) AS sessionId
FROM SessionFlag
)
SELECT user_id, COUNT(DISTINCT sessionId), MAX(sessionId)
FROM UserSession
GROUP BY user_id
ORDER BY user_id desc limit 20;

9.JSON字段解析

json_tuple()
get_json_object()

10.半连接查询

左表数据筛选、关联效率优化。
相比于 INNER JOIN(可能因右表多条匹配导致左表重复)

Hive中常用 IN 子查询 或 EXISTS 子查询实现 半连接。避免使用 INNER JOIN 导致的左表重复。

11.金额区间分桶

核心是 CASE WHEN

1
2
3
SELECT CASE WHEN item_category = '7344' THEN 'goodsA' WHEN item_category > '10000' THEN 'goodsB' WHEN item_category <10000 THEN 'goodsC'  END as  goodsCategory, 
count(1)
FROM user_behavior_v2 GROUP BY goodsCategory limit 20 ;

12.用户留存率计算

13.日活跃用户增长率

14.交叉标签共线

15.用户分群(K-means前处理)

16.多级分类销量汇总

17.事件类型分布占比

18.环比和同比

19.设备多维度统计