当前位置：首页 > news >正文

python之压缩(.tar.bz2、.tar.xz)

news 2025/10/17 0:06:46

一、常见的压缩性能对比

1、主流压缩算法对比

算法	压缩比	速度	内存占用	最佳场景
DEFLATE (gzip)	中等	★★★★	低	通用数据、网络传输
bzip2	★★★	★★	中等	文本数据长期存储
LZMA	★★★★	★	高	冷数据归档
Zstandard	★★	★★★★	中等	实时数据处理

从上表可见，bz2在压缩比和速度间取得了良好平衡，特别适合对压缩率要求较高的文本数据存储。

二、bz2 压缩实操

Python的bz2模块正是这一算法的标准实现，无需额外安装，直接导入即可使用：

import bz2

1、文件压缩与解压

文件级操作是bz2模块最常用的功能，类似于Python内置的open()函数：

import bz2# 压缩字符串
data = "这是要压缩的大量文本内容。" * 10000
compressed = bz2.compress(data.encode('utf-8'))
print(f"原始大小: {len(data.encode('utf-8'))} 字节")
print(f"压缩后大小: {len(compressed)} 字节")# 解压
decompressed = bz2.decompress(compressed).decode('utf-8')
print("解压成功:", decompressed == data)

##########################################################
写入压缩文件
with bz2.open('data.txt.bz2', 'wb', compresslevel=9) as f:f.write(b"Large content needs compressing...")# 读取压缩文件
with bz2.open('data.txt.bz2', 'rb') as f:content = f.read()print(content.decode())

这里的compresslevel参数（1-9）控制压缩强度：1级压缩最快但压缩率最低，9级压缩最慢但压缩率最高，默认值为9。

2、内存数据压缩与解压

对于较小数据，可以直接在内存中进行压缩和解压操作：

original_data = b"Repeated patterns compress well with bzip2..."# 压缩数据
compressed = bz2.compress(original_data, compresslevel=7)
print(f"压缩后大小：{len(compressed)}字节，压缩率：{len(compressed)/len(original_data):.1%}")# 解压数据
restored = bz2.decompress(compressed)
assert original_data == restored

3、增量处理大文件

对于不适合一次性加载到内存的超大文件，可以使用增量压缩/解压：

# 增量压缩
compressor = bz2.BZ2Compressor(compresslevel=5)
chunks =[b'Chunk one',b'Chunk two',b'Chunk three']
compressed_chunks =[]for chunk in chunks:
    compressed_chunks.append(compressor.compress(chunk))
compressed_chunks.append(compressor.flush())# 结束压缩# 增量解压
decompressor = bz2.BZ2Decompressor()
result =b""
for chunk in compressed_chunks:
    result += decompressor.decompress(chunk)

这种方法特别适合处理网络流或实时生成的数据。

4、数据存档与长期存储

在数据分析领域，原始数据往往包含大量重复信息。使用bz2压缩可以显著减少存储成本：

import pandas as pd
import bz2# 保存压缩的CSV
df = pd.read_csv('large_dataset.csv')
with bz2.open('large_dataset.csv.bz2', 'wt') as f:
    df.to_csv(f, index=False)# 从压缩文件读取
with bz2.open('large_dataset.csv.bz2', 'rt') as f:
    restored_df = pd.read_csv(f)

实验表明，对于文本格式的数据集，bz2通常比gzip多节省15-25%的空间。

5、日志文件处理

服务器日志通常体积庞大但压缩率高，适合使用bz2：

import bz2def compress_logs(log_path, output_path):try:with open(log_path, 'rb') as f_in:with bz2.open(output_path, 'wb') as f_out:while True:chunk = f_in.read(1024 * 1024)  # 每次读取1MBif not chunk:breakf_out.write(chunk)print(f"日志已压缩保存至: {output_path}")except FileNotFoundError:print(f"错误：文件未找到 - {log_path}")except PermissionError:print(f"错误：权限不足，无法读取 {log_path} 或写入 {output_path}")except Exception as e:print(f"压缩过程中发生错误: {e}")

6、网络数据传输优化

在微服务架构中，压缩API响应可大幅减少传输时间：

from flask import Flask, Response
import bz2
import jsonapp = Flask(__name__)@app.route('/large-data')
def get_large_data():data = generate_large_json()  # 生成大量数据compressed = bz2.compress(json.dumps(data).encode())return Response(compressed, headers={'Content-Encoding': 'bzip2'})

逐行解释：

@app.route('/large-data')
定义一个路由，当用户访问 /large-data 时，调用 get_large_data() 函数。
data = generate_large_json()
调用一个假设存在的函数，生成一个大型的 Python 数据结构（如列表或字典），准备返回给客户端。
⚠️ 注意：这个函数在代码中未定义，只是一个占位符。
compressed = bz2.compress(json.dumps(data).encode())
- json.dumps(data): 将 data 转为 JSON 格式的字符串。
- .encode(): 将字符串编码为字节（bytes），因为 bz2.compress 需要字节输入。
- bz2.compress(...): 使用 bzip2 算法压缩这些字节。
- 结果是：一个压缩后的二进制数据（bytes 类型）。
return Response(compressed, headers={'Content-Encoding': 'bzip2'})
- 返回一个 Response 对象，包含压缩后的数据。
- 设置响应头 Content-Encoding: bzip2，告诉客户端：响应体是用 bzip2 压缩过的。

7、高效处理大文件（流式压缩，节省内存）

适用于日志、JSON、CSV 等大文件，避免一次性加载到内存。

import bz2def compress_large_file(input_path, output_path, chunk_size=1024*1024):  # 1MB 每块"""流式压缩大文件，避免内存溢出"""with open(input_path, 'rb') as f_in:with bz2.open(output_path, 'wb') as f_out:while True:chunk = f_in.read(chunk_size)if not chunk:breakf_out.write(chunk)print(f"✅ 压缩完成: {input_path} -> {output_path}")def decompress_large_file(compressed_path, output_path):"""流式解压 bzip2 文件"""with bz2.open(compressed_path, 'rb') as f_in:with open(output_path, 'wb') as f_out:while True:chunk = f_in.read(1024 * 1024)if not chunk:breakf_out.write(chunk)print(f"✅ 解压完成: {compressed_path} -> {output_path}")# 使用示例
compress_large_file('large_log.txt', 'large_log.txt.bz2')
decompress_large_file('large_log.txt.bz2', 'large_log_decompressed.txt')

8、多进程并行压缩多个文件（提升批量处理速度）

利用 concurrent.futures.ProcessPoolExecutor 并行压缩多个文件

import bz2
from concurrent.futures import ProcessPoolExecutor, as_completed
import osdef compress_file(args):"""单个文件压缩函数，供多进程使用args: (input_path, output_path)"""input_path, output_path = argstry:with open(input_path, 'rb') as f_in:with bz2.open(output_path, 'wb', compresslevel=6) as f_out:while chunk := f_in.read(1024*1024):f_out.write(chunk)print(f"📦 已压缩: {input_path} -> {output_path}")return input_path, Trueexcept Exception as e:print(f"❌ 压缩失败 {input_path}: {e}")return input_path, Falsedef parallel_compress(files, max_workers=None):"""并行压缩多个文件files: 列表，元素为 (input_path, output_path)max_workers: 最大进程数，默认为 CPU 核心数"""if max_workers is None:max_workers = os.cpu_count()with ProcessPoolExecutor(max_workers=max_workers) as executor:futures = [executor.submit(compress_file, file) for file in files]results = {}for future in as_completed(futures):filename, success = future.result()results[filename] = successreturn results# 使用示例：批量压缩多个日志文件
if __name__ == '__main__':files_to_compress = [('log1.txt', 'log1.txt.bz2'),('log2.txt', 'log2.txt.bz2'),('log3.txt', 'log3.txt.bz2'),]# 确保输入文件存在（示例）for i in range(1, 4):with open(f'log{i}.txt', 'w') as f:f.write(f"这是日志文件 {i} 的模拟内容。\n" * 10000)# 开始并行压缩result = parallel_compress(files_to_compress)print("最终结果:", result)