昆明市做网站,2022年中国电商行业分析报告,wordpress 换域名 插件,广东网络公司网站今天想爬取一些政策#xff0c;从政策服务 (smejs.cn) 这个网址爬取#xff0c;html源码找不到链接地址#xff0c;通过浏览器的开发者工具#xff0c;点击以下红框 分析预览可知想要的链接地址的id有了#xff0c;进行地址拼接就行 点击标头可以看到请求后端服务器的api地…今天想爬取一些政策从政策服务 (smejs.cn) 这个网址爬取html源码找不到链接地址通过浏览器的开发者工具点击以下红框 分析预览可知想要的链接地址的id有了进行地址拼接就行 点击标头可以看到请求后端服务器的api地址通过拿到这个地址编写python脚本不会的可以让gpt帮你写很好用
import requests
import pandas as pd
import logging
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry# 设置日志
logging.basicConfig(levellogging.INFO, format%(asctime)s - %(levelname)s - %(message)s)# 请求头信息
headers {Content-Type: application/json,User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
}# 基础URL
base_url https://policy-gateway.smejs.cn/policy/api/policy/getNewPolicyList
base_policy_url https://policy.smejs.cn/frontend/policy-service/# 参数
params {orderBy: ,keyWords: ,genreCode: K,A,S,Z,queryPublishBegin: ,queryPublishEnd: ,queryApplyBegin: ,queryApplyEnd: ,typeCondition: ,publishUnit: ,applyObj: ,meetEnterprise: ,title: ,commissionOfficeIds: ,commissionOfficeSearchIds: ,industry: ,relativePlatform: ,level: ,isSearch: N,policyType: ,provinceValue: 江苏省,cityValue: ,regionValue: ,current: 1,size: 15,total: 23960,page: 0
}# 总条目数和每页条目数
total_policies 23960
page_size 15
total_pages (total_policies // page_size) 1# 存储所有政策数据
all_policies []# 配置重试策略
retry_strategy Retry(total5,status_forcelist[429, 500, 502, 503, 504],allowed_methods[HEAD, GET, OPTIONS]
)
adapter HTTPAdapter(max_retriesretry_strategy)
http requests.Session()
http.mount(https://, adapter)
http.mount(http://, adapter)# 遍历每一页
for page in range(total_pages):params[current] page 1try:response http.get(base_url, headersheaders, paramsparams, verifyFalse)response.raise_for_status()except requests.exceptions.RequestException as e:logging.error(fFailed to fetch data for page {page 1}: {e})continuedata response.json()if records not in data[data]:logging.error(fNo records found for page {page 1})continuerecords data[data][records]for record in records:policy_id record.get(id)level_value record.get(levelValue)title record.get(title)type_value record.get(typeValue)commission_office_names record.get(commissionOfficeNames)publish_time record.get(publishTime)valid_date_end record.get(validDateEnd)policy_url base_policy_url policy_idall_policies.append({ID: policy_id,URL: policy_url,Level Value: level_value,Title: title,Type Value: type_value,Commission Office Names: commission_office_names,Publish Time: publish_time,Valid Date End: valid_date_end})logging.info(fFetched data for page {page 1})time.sleep(1) # 防止过快请求# 转换为DataFrame
df pd.DataFrame(all_policies)# 保存到Excel
df.to_excel(policies.xlsx, indexFalse)
logging.info(Data saved to policies.xlsx)然后运行后就等到爬取完成了后面也可以多线程爬还没试不知道是否有防爬机制。。。。