Extract Text From a HTML Tag with Regex

1. Introduction

When working with HTML content in Java, extracting specific text from HTML tags is common. While using regular expressions (regex) for parsing HTML is generally discouraged due to its complex structure, it can sometimes be sufficient for simple tasks.

In this tutorial, we’ll see how to extract text from HTML tags using regex in Java.

2. Using Pattern and Matcher Classes

Java provides the Pattern and Matcher classes from java.util.regex, allowing us to define and apply regular expressions to extract text from strings. Below is an example of how to extract text from a specified HTML tag using regex:

@Test
void givenHtmlContentWithBoldTags_whenUsingPatternMatcherClasses_thenExtractText() {
    String htmlContent = "<div>This is a <b>Baeldung</b> article for <b>extracting text</b> from HTML tags.</div>";
    String tagName = "b";
    String patternString = "<" + tagName + ">(.*?)</" + tagName + ">";
    Pattern pattern = Pattern.compile(patternString);
    Matcher matcher = pattern.matcher(htmlContent);
    List<String> extractedTexts = new ArrayList<>();
    while (matcher.find()) {
        extractedTexts.add(matcher.group(1));
    }
    assertEquals("Baeldung", extractedTexts.get(0));
    assertEquals("extracting text", extractedTexts.get(1));
}

Here, we first define the HTML content, denoted as htmlContent, which contains HTML with  tags. Moreover, we specify the tag name tagName as “b” to extract text from tags.

Then, we compile the regex pattern using the compile() method, where patternString is “(.*?)” to match and extract text within  tags. Afterward, we use a while loop with the find() method to iterate over all matches and add them to the list named extractedTexts.

Finally, we assert that two texts (“Baeldung” and “extracting text“) are extracted from the  tags.

To handle cases where tag contents may contain newlines, we can modify the pattern string by adding (?s) as follows:

String patternString = "(?s)<" + tagName + ">(.*?)</" + tagName + ">";

Here, we use a regex pattern “(?s)(.*?)” with dotall mode enabled (?s) to match  tags across multiple lines.

3. Using JSoup for HTML Parsing and Extraction

For more complex HTML parsing tasks, especially those involving nested tags, using a dedicated library like JSoup is recommended. Let’s demonstrate how to use JSoup to extract text from  tags, including handling nested tags:

@Test
void givenHtmlContentWithNestedParagraphTags_thenExtractAllTextsFromHtmlTag() {
    String htmlContent = "<div>This is a <p>multiline\nparagraph <strong>with nested</strong> content</p> and <p>line breaks</p>.</div>";
    Document doc = Jsoup.parse(htmlContent);
    Elements paragraphElements = doc.select("p");
    List<String> extractedTexts = new ArrayList<>();
    for (Element paragraphElement : paragraphElements) {
        String extractedText = paragraphElement.text();
        extractedTexts.add(extractedText);
    }
    assertEquals(2, extractedTexts.size());
    assertEquals("multiline paragraph with nested content", extractedTexts.get(0));
    assertEquals("line breaks", extractedTexts.get(1));
}

Here, we use the parse() method to parse the htmlContent string, converting it into a Document object. Next, we employ the select() method on the doc object to select all  elements within the parsed document.

Subsequently, we iterate over the selected paragraphElements collection, extracting text content from each  element using the paragraphElement.text() method.

4. Conclusion

In conclusion, we have explored different approaches to extracting text from HTML tags in Java. Firstly, we discussed using the Pattern and Matcher classes for regex-based text extraction. Additionally, we examined leveraging JSoup for more complex HTML parsing tasks.

As always, the complete source code for the examples is available over on GitHub.

Extract Text From a HTML Tag with Regex

1. Introduction

2. Using Pattern and Matcher Classes

3. Using JSoup for HTML Parsing and Extraction

4. Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112