首页 > 代码库 > java 去掉html标签(Java中去掉网页HTML标记的方法 )--正则表达式
java 去掉html标签(Java中去掉网页HTML标记的方法 )--正则表达式
参考:
http://www.cnblogs.com/newsouls/p/3995394.html
http://blog.csdn.net/he20101020/article/details/21228311
内容:
package utils; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * 注:\n 回车(\u000a) * \t 水平制表符(\u0009) * \s 空格(\u0008) * \r 换行(\u000d) * Created by Administrator on 2017/7/14. */ public class HtmlUtil { public static void main(String[] args) { String str = "<div class=\"WB_cardwrap S_bg2\">\n" + " <div class=\"search_feed\">\n" + " <div class=\"person_list_feed clearfix\">\n" + " <div class=\"pl_personlist\">\n" + " <div class=\"list_person clearfix\">\n" + " <div class=\"person_pic\">\n" + " <a target=\"_blank\" href=http://www.mamicode.com/"http://weibo.com/114dotcom?refer_flag=1001030201_\" title=\"114导航\" suda-data=http://www.mamicode.com/"key=tblog_search_user&value=http://www.mamicode.com/user_feed_1_icon/">\n" + " <img class=\"W_face_radius\" src=http://www.mamicode.com/"http://tva4.sinaimg.cn/crop.176.129.505.505.180/006aCV5fgw1f0gxs4ikuyj30ne0juq47.jpg\" uid=\"5653836249\" height=\"80\" width=\"80\" /></a>\n" + " </div>\n" + " <div class=\"person_detail\">\n" + " <p class=\"person_name\">\n" + " <a class=\"W_texta W_fb\" target=\"_blank\" href=http://www.mamicode.com/"http://weibo.com/114dotcom?refer_flag=1001030201_\" title=\"114导航\" uid=\"5653836249\" suda-data=http://www.mamicode.com/"key=tblog_search_user&value=http://www.mamicode.com/user_feed_1_name/">114导航</a>\n" + " <a target=\"_blank\" href=http://www.mamicode.com/"http://verified.weibo.com/verify\" title=\"微博机构认证\" alt=\"微博机构认证\" class=\"W_icon icon_approve_co\"></a>\n" + " </p>\n" + " <p class=\"person_addr\">\n" + " <span class=\"male m_icon\" title=\"男\"></span>\n" + " <span>广东</span>\n" + " <a class=\"W_linkb\" target=\"_blank\" href=http://www.mamicode.com/"http://weibo.com/114dotcom?refer_flag=1001030201_\" class=\"wb_url\" suda-data=http://www.mamicode.com/"key=tblog_search_user&value=http://www.mamicode.com/user_feed_1_url/">http://weibo.com/114dotcom</a></p>\n" + " <p class=\"person_card\">\n" + " <em class=\"red\">一一四网络有限公司</em></p>\n" + " <p class=\"person_num\">\n" + " <span>关注\n" + " <a class=\"W_linkb\" href=http://www.mamicode.com/"http://weibo.com/5653836249/follow?refer_flag=1001030201_\" target=\"_blank\" suda-data=http://www.mamicode.com/"key=tblog_search_user&value=http://www.mamicode.com/user_feed_1_num/">68</a></span>\n" + " <span>粉丝\n" + " <a class=\"W_linkb\" href=http://www.mamicode.com/"http://weibo.com/5653836249/fans?refer_flag=1001030201_\" target=\"_blank\" suda-data=http://www.mamicode.com/"key=tblog_search_user&value=http://www.mamicode.com/user_feed_1_num/">118</a></span>\n" + " <span>微博\n" + " <a class=\"W_linkb\" href=http://www.mamicode.com/"http://weibo.com/5653836249/profile?refer_flag=1001030201_\" target=\"_blank\" suda-data=http://www.mamicode.com/"key=tblog_search_user&value=http://www.mamicode.com/user_feed_1_num/">7</a></span>\n" + " </p>\n" + " <div class=\"person_info\">\n" + " <p>简介: 114.com,不一样的导航,能记住吗?追求小清新,简约到极致。推荐给您的不仅是网址,更是给你需要的答案。此外,114.com还提供电话、品牌、名人、价格等各种实用查询。</p>\n" + " </div>\n" + " <p class=\"person_label\">标签:\n" + " <a class=\"W_linkb\" href=http://www.mamicode.com/"&tag=%25E6%2596%25B0%25E9%2597%25BB%25E7%2583%25AD%25E7%2582%25B9&Refer=SUer_tag\" suda-data=http://www.mamicode.com/"key=tblog_search_user&value=http://www.mamicode.com/user_feed_1_label/">新闻热点</a></p>\n" + " </div>\n" + " </div>\n" + " </div>\n" + " </div>\n" + " </div>\n" + "</div>\n" + "<div class=\"WB_cardwrap S_bg2 relative\"></div>\n" + "<!-- 未登录提示 -->\n" + "<div class=\"search_tips clearfix\">\n" + " <p class=\"tips_co\">\n" + " <span class=\"tips_icon icon_warn\"></span>\n" + " <span class=\"tips_txt\">\n" + " <a href=http://www.mamicode.com/"javascript:void(0);\" action-type=\"login\">立即登录</a>查看更多结果。还没有账号?赶紧\n" + " <a href=http://www.mamicode.com/"http://weibo.com/signup/signup.php?lang=zh-cn&entry=weisousuo\" suda-data=http://www.mamicode.com/"key=tblog_search_v4.1&value=http://www.mamicode.com/nologin_reg/" target=\"_blank\">注册微博</a></span>\n" + " </p>\n" + "</div>\n" + "<!-- /未登录提示 -->"; System.out.println(delHTMLTag(str)); } public static String delHTMLTag(String htmlStr){ String regEx_script="<script[^>]*?>[\\s\\S]*?<\\/script>"; //定义script的正则表达式 String regEx_style="<style[^>]*?>[\\s\\S]*?<\\/style>"; //定义style的正则表达式 String regEx_html="<[^>]+>"; //定义HTML标签的正则表达式 String regEx_space = "\\s*|\t|\r|\n";//定义空格回车换行符 Pattern p_script=Pattern.compile(regEx_script,Pattern.CASE_INSENSITIVE); Matcher m_script=p_script.matcher(htmlStr); htmlStr=m_script.replaceAll(""); //过滤script标签 Pattern p_style=Pattern.compile(regEx_style,Pattern.CASE_INSENSITIVE); Matcher m_style=p_style.matcher(htmlStr); htmlStr=m_style.replaceAll(""); //过滤style标签 Pattern p_html=Pattern.compile(regEx_html,Pattern.CASE_INSENSITIVE); Matcher m_html=p_html.matcher(htmlStr); htmlStr=m_html.replaceAll(""); //过滤html标签 Pattern p_space = Pattern.compile(regEx_space, Pattern.CASE_INSENSITIVE); Matcher m_space = p_space.matcher(htmlStr); htmlStr = m_space.replaceAll(""); // 过滤空格回车标签 return htmlStr.trim(); //返回文本字符串 } public static String stripHtml(String content) { // <p>段落替换为换行 content = content.replaceAll("<p .*?>", "\r\n"); // <br><br/>替换为换行 content = content.replaceAll("<br\\s*/?>", "\r\n"); // 去掉其它的<>之间的东西 content = content.replaceAll("\\<.*?>", ""); // 还原HTML // content = HTMLDecoder.decode(content); return content; } public static String getTextFromHtml(String htmlStr){ htmlStr = delHTMLTag(htmlStr); htmlStr = htmlStr.replaceAll(" ", ""); //htmlStr = htmlStr.substring(0, htmlStr.indexOf("。")+1); return htmlStr; } }
java 去掉html标签(Java中去掉网页HTML标记的方法 )--正则表达式
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。