博客园爬虫模拟

首页 > 代码库 > 博客园爬虫模拟

2024-10-02 06:39:39 211人阅读

  /*             原理分析:              1.通过抓包工具 分析请求地址:http://www.cnblogs.com/liuxiaoji/p/4689119.html             2.可以看出这个请求是GET请求             3.通过http请求把数据抓取回来             4.HttpHelper帮助类请联系作者购买            */            HttpHelper http = new HttpHelper();            string htmlText = http.HttpGet("http://www.cnblogs.com/liuxiaoji/p/4689119.html",string.Empty, Encoding.UTF8, false, false, 5000);            // 正则css路径分析             Regex linkCss = new Regex(@"<link\b[^<>]*?\bhref[\s\t\r\n]*=[\s\t\r\n]*[""‘]?[\s\t\r\n]*(?<url>[^\s\t\r\n""‘<>]*)[^<>]*?/?[\s\t\r\n]*>", RegexOptions.IgnoreCase);            // 搜索匹配的字符串             MatchCollection matches = linkCss.Matches(htmlText);            // 取得匹配项列表             foreach (Match match in matches)            {                var item = match.Groups["url"].Value;                if (!item.Contains("http://www.cnblogs.com"))                {                    htmlText = htmlText.Replace(item, item.Contains("/skins") ? $"http://www.cnblogs.com{item}" : $"http://www.cnblogs.com/skins{item}");                }            }            // 最终结果            var result = htmlText;            // 文件保存            using (FileStream fs = new FileStream("E:\\liuxiaoji.html", FileMode.Create))            {                var data =http://www.mamicode.com/ Encoding.UTF8.GetBytes(result);                fs.Write(data, 0, data.Length);            }

博客园爬虫模拟

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 博客园爬虫模拟

博客园爬虫模拟

看完仍有疑问？有类似问题直接问程序猿