正则表达式如何提取html标签里面的内容

2024-12-31 22:55:55

推荐回答（5个）

回答1：

只提取rufus，jenny？不行吧。没有规律啊。是把所有的标签内内容提取了吧。

如果是提取标签内的话这么写：
        Pattern pattern = Pattern.compile(">([^<]+)<");
       
 Matcher macher = 
pattern.matcher("
Rufus
Dan,
 Jenny! Over here! 
Jenny
Hey, dad!
 
Rufus
Hey, 
hey! You made it. Welcome back! How was your weekend? How was your mom? 
");
        
        while (macher.find())
        {
            System.out.println(macher.group(1));
        }

打印结果：
Rufus
Dan, Jenny! Over here! 
Jenny
Hey, dad! 
Rufus
Hey, hey! You made it. Welcome back! How was your weekend? How was your mom?

麻烦采纳我的答案吧，(*^__^*) 嘻嘻……

回答2：

function getStr(id,str){
	var p = document.getElementById(id);
	var text = p.innerHTML;
	return text.substring(text.indexOf(str),text.indexOf(str)+str.length); 
}
alert(getStr('p1','Rufus'))
//我给第一个p元素加了一个id，是p1，其他的三个也是这样提取出来的。换个id，换个字符就行了。这是不完整的提取字符的方法。如果想较为完整一些，可以在里面加一个判断语句，如果你所搜索的字符不存在，返回一个错误或者警告什么都可以。
//我没有使用正则，根本不需要正则就可以解决了。

回答3：


你的标签貌似不太规则吧 
Rufus
 乱嵌呀 

 public void strong()
 {
  int i = 0;
  final String regex = "";
  final Pattern pt = Pattern.compile(regex);
  final Matcher mt = pt.matcher(ContentArea);
  while (mt.find()) {
   System.out.println(mt.group());
   i++;

   // 获取标题
   final Matcher title = Pattern.compile(">.*?").matcher(mt.group());
   while (title.find()) {
    System.out.println("strong是:"
      + title.group().replaceAll(">|", ""));
  }
   System.out.println();
  }

  
  public static void main(String[] args)
 {
  Urls myurl = new Urls("");
  myurl.getStartUrl("...");//网址
  myurl.getUrlContent();
  myurl.getContentArea();
  myurl.strong();
 }

回答4：

$str="yyy

zzz

yyy1

zzz1";
$pattern='/

]+href=\'([^\']*)\'[^>]*>([^<]*)<\/a>.*([^<]*)<\/i><\/li>/iUs';
preg_match_all($pattern, $str, $matches);
print_r($matches);
看下可以不，解析出来的数组应该知道怎么解吧！

回答5：

思路：先解析html文件，可以用digester等第三方包。
想直接用正则表达式，不建议。
正则用的更多是校验格式，例如邮箱格式等。