CF700D Huffman Coding on Segment

时间：2021-02-02 12:20:47 阅读：32 评论：0 收藏：0 [点我收藏+]

CF700D Huffman Coding on Segment

题目大意

题目链接

考虑对一个字符串（本题里用数字序列表示）进行二进制编码。使得：

字符串里出现过的每个字符，都对应一个二进制编码。
不存在两个字符 \(i, j\)，使得 \(i\) 的编码是 \(j\) 的编码的前缀。

一种编码方案的“长度”，是指把字符串里每个字符对应的编码顺次连接起来，得到的二进制串的长度。

给定一个长度为 \(n\) 的序列 \(a\)。\(q\) 次询问。每次询问给出区间 \(l_i, r_i\)，求给字符串 \(a_{l_i}, a_{l_i + 1},\dots,a_{r_i}\) 编码的最小长度。

数据范围：\(1\leq n, q, a_i\leq 10^5\)。

本题题解

一种编码方案，可以看做一棵二叉树。从根出发，向左走一步表示一个 \(0\)，向右走一步表示一个 \(1\)。每个字符，都是二叉树上的一个叶子。设第 \(i\) 种字符的出现次数为 \(c_i\)，在二叉树上的深度为 \(d_i\)，则该编码方案的长度就是：\(\sum_{i} c_i\cdot d_i\)。我们要构造出一棵二叉树，来最小化这个值。

这是一个经典的问题。可以用贪心法求解。

设有 \(k\) 种字符。初始时，只有 \(k\) 个节点，它们之间还没有连边。可以看做是 \(k\) 棵树，每个节点都是根。每个点的点权是 \(c_i\)。每次选择两个点权最小的根节点，将它们合并。即：新建一个节点，作为它们的父亲，新节点的点权是原来两个根节点的点权之和。同时因为两个节点高度都增加了 \(1\)，所以答案要加上这两个节点的点权和。可以参考如下的伪代码：

\[\begin{array}{l} \textbf{def: } \mathrm{calc}(c_1, c_2, \dots ,c_k) \\qquad S \leftarrow \{c_i\}\\qquad \mathrm{res}\leftarrow 0\\qquad \textbf{while } (|S| > 1) \\qquad \qquad a \leftarrow \min\{S\} \\qquad \qquad S \leftarrow S \setminus a\\qquad \qquad b \leftarrow \min\{S\} \\qquad \qquad S \leftarrow S \setminus b\\qquad \qquad \mathrm{res}\leftarrow \mathrm{res} + a + b\\qquad \qquad S \leftarrow S \cup \{a + b\}\\qquad \textbf{endwhile.} \\textbf{enddef.} \end{array} \]

可以用 \(\texttt{std::priority_queue}\) 实现上述过程。时间复杂度 \(\mathcal{O}(k \log k)\)。

本题里，我们要处理 \(q\) 次询问。朴素做法的时间复杂度是 \(\mathcal{O}(qn\log n)\)，无法通过。

可以用莫队算法来维护每个数值的出现次数，这部分的时间复杂度是 \(\mathcal{O}(q\sqrt{n})\)。

然后如何计算答案呢？因为没有直观的做法，所以考虑根号分治：把两个暴力拼起来！设置一个阈值 \(B\)。

对于区间里，出现次数 \(\geq B\) 的数，这样的数最多只有 \(\frac{n}{B}\) 个，可以直接用 \(\texttt{std::priority_queue}\) 实现上述的贪心做法，时间复杂度 \(\mathcal{O}(\frac{n}{B}\log \frac{n}{B})\)。
出现次数 \(< B\) 的数，它们的数量可能很多。所以我们不枚举这些数。而是直接枚举它们的出现次数，即 \(1\dots B - 1\)。可以提前用一个数组存好，每种出现次数对应了多少个数值。从 \(1\) 到 \(B - 1\) 扫描所有出现次数，模拟上述贪心做法的过程：设当前扫描到的出现次数为 \(i\)，对应的数值有 \(t_i\) 个，那么将它们两两合并为 \(2i\)，即：\(t_{2i}\leftarrow t_{2i} + \frac{t_i}{2}\)。特别地，如果 \(2i\geq B\)，则直接暴力将这 \(\frac{t_i}{2}\) 个数加入优先队列，和后面（出现次数 \(\geq B\)）的数放在一起贪心。另外，\(t_i\) 是奇数时，需要注意一些细节。

下面分析第二种情况的时间复杂度，首先扫描一遍肯定是 \(\mathcal{O}(B)\) 的。重点在于 \(2i\geq B\) 时，暴力将 \(\frac{t_i}{2}\) 个数加入优先队列，总共会加几次。我们每次操作会令 \(t_{2i}\leftarrow t_{2i} + \frac{t_i}{2}\)。发现不管操作多少次，\(\sum i\cdot t_i\) 始终是不变的。因为初始时 \(\sum i\cdot t_i\leq n\)，所以最终的 \(\sum_{i\geq B} i\cdot t_i\) 还是 \(\leq n\) 的。我们加入优先队列的元素数量，就是此时的 \(\sum_{i\geq B} t_i\)，它是 \(\mathcal{O}(\frac{n}{B})\) 级别的。

于是我们能够在 \(\mathcal{O}(B + \frac{n}{B}\log \frac{n}{B})\) 的时间复杂度内回答单次询问。取 \(B = \sqrt{n\log n}\)。总时间复杂度 \(\mathcal{O}(q\sqrt{n \log n})\)。（大概）。

参考代码

实际提交时建议使用读入、输出优化，详见本博客公告。

// problem: CF700D
#include <bits/stdc++.h>
using namespace std;

#define mk make_pair
#define fi first
#define se second
#define SZ(x) ((int)(x).size())

typedef unsigned int uint;
typedef long long ll;
typedef unsigned long long ull;
typedef pair<int, int> pii;

template<typename T> inline void ckmax(T& x, T y) { x = (y > x ? y : x); }
template<typename T> inline void ckmin(T& x, T y) { x = (y < x ? y : x); }

const int MAXN = 1e5, MAXA = 1e5;
const int BLOCK_SIZE = 320, BLOCK_NUM = MAXN / BLOCK_SIZE + 5;
const int THR = 600;

int n, a[MAXN + 5];
int bel[MAXN + 5], st[BLOCK_NUM + 5], ed[BLOCK_NUM + 5], cnt_block;

int q;
ll ans[MAXN + 5];
struct Query_t {
	int l, r, id;
} qs[MAXN + 5];
bool cmp(Query_t lhs, Query_t rhs) {
	if (bel[lhs.l] == bel[rhs.l]) {
		return lhs.r < rhs.r;
	}
	return lhs.l < rhs.l;
}

int buc[MAXN + 5]; // 每种数值的出现次数
int buc2[MAXN + 5]; // 每种"出现次数"的出现次数
int idx[MAXN + 5], big[MAXN + 5], cnt_big;

void ins(int x) {
	// cerr << "ins " << x << endl;
	buc2[buc[x]]--;
	buc[x]++;
	buc2[buc[x]]++;
	if (buc[x] == THR) { // 刚好达到
		big[++cnt_big] = x;
		idx[x] = cnt_big;
	}
}
void del(int x) {
	// cerr << "del " << x << endl;
	buc2[buc[x]]--;
	buc[x]--;
	buc2[buc[x]]++;
	if (buc[x] == THR - 1) {
		if (idx[x] == cnt_big) {
			idx[x] = 0;
			--cnt_big;
			return;
		}
		idx[big[cnt_big]] = idx[x];
		swap(big[idx[x]], big[cnt_big]);
		idx[x] = 0;
		--cnt_big;
	}
}

int curl, curr;
void movel(int t) {
	if (t == -1) {
		--curl;
		ins(a[curl]);
	} else {
		del(a[curl]);
		++curl;
	}
}
void mover(int t) {
	if (t == 1) {
		++curr;
		ins(a[curr]);
	} else {
		del(a[curr]);
		--curr;
	}
}

int tmp_buc2[MAXN + 5];
ll calc_ans() {
	priority_queue<int> que;
	for (int i = 1; i < THR; ++i) { // 出现次数少的数值
		tmp_buc2[i] = buc2[i];
	}
	
	ll ans = 0;
	int lst = 0;
	for (int i = 1; i < THR; ++i) {
		if (!tmp_buc2[i])
			continue;
		
		if (lst) {
			tmp_buc2[i]--;
			ans += i + lst;
			if (lst + i < THR) {
				tmp_buc2[lst + i]++;
			} else {
				que.push(- lst - i);
			}
			
			lst = 0;
		}
		if (tmp_buc2[i] % 2 == 1) {
			tmp_buc2[i]--;
			lst = i;
		}
		
		ans += (ll)i * tmp_buc2[i];
		if (i * 2 < THR) {
			tmp_buc2[i * 2] += tmp_buc2[i] / 2;
		} else {
			for (int j = 1; j <= tmp_buc2[i] / 2; ++j) {
				que.push(- i * 2);
			}
		}
	}
	if (lst) {
		que.push(-lst);
	}
	for (int i = 1; i <= cnt_big; ++i) {
		que.push(- buc[big[i]]);
	}
	while (SZ(que) > 1) {
		int x = -que.top();
		que.pop();
		int y = -que.top();
		que.pop();
		ans += x + y;
		que.push(- x - y);
	}
	return ans;
}

int main() {
	cin >> n;
	for (int i = 1; i <= n; ++i) {
		cin >> a[i];
	}
	for (int i = 1; i <= n; i += BLOCK_SIZE) {
		int j = min(i + BLOCK_SIZE - 1, n);
		++cnt_block;
		st[cnt_block] = i;
		ed[cnt_block] = j;
		for (int k = i; k <= j; ++k) {
			bel[k] = cnt_block;
		}
	}
	
	cin >> q;
	for (int i = 1; i <= q; ++i) {
		cin >> qs[i].l >> qs[i].r;
		qs[i].id = i;
	}
	sort(qs + 1, qs + q + 1, cmp);
	buc2[0] = MAXA;
	
	curr = 0;
	curl = 1;
	for (int i = 1; i <= q; ++i) {
		while (curr < qs[i].r) mover(1);
		while (curl > qs[i].l) movel(-1);
		while (curr > qs[i].r) mover(-1);
		while (curl < qs[i].l) movel(1);
		ans[qs[i].id] = calc_ans();
	}
	for (int i = 1; i <= q; ++i) {
		cout << ans[i] << endl;
	}
	return 0;
}

CF700D Huffman Coding on Segment

原文：https://www.cnblogs.com/dysyn1314/p/14361121.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)