1. Introduction
When working with HTML content in Java, extracting specific text from HTML tags is common. While using regular expressions (regex) for parsing HTML is generally discouraged due to its complex structure, it can sometimes be sufficient for simple tasks.
In this tutorial, we’ll see how to extract text from HTML tags using regex in Java.
2. Using Pattern and Matcher Classes
Java provides the Pattern and Matcher classes from java.util.regex, allowing us to define and apply regular expressions to extract text from strings. Below is an example of how to extract text from a specified HTML tag using regex:
@Test
void givenHtmlContentWithBoldTags_whenUsingPatternMatcherClasses_thenExtractText() {
String htmlContent = "<div>This is a <b>Baeldung</b> article for <b>extracting text</b> from HTML tags.</div>";
String tagName = "b";
String patternString = "<" + tagName + ">(.*?)</" + tagName + ">";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(htmlContent);
List<String> extractedTexts = new ArrayList<>();
while (matcher.find()) {
extractedTexts.add(matcher.group(1));
}
assertEquals("Baeldung", extractedTexts.get(0));
assertEquals("extracting text", extractedTexts.get(1));
}
Here, we first define the HTML content, denoted as htmlContent, which contains HTML with <b> tags. Moreover, we specify the tag name tagName as “b” to extract text from <b> tags.
Then, we compile the regex pattern using the compile() method, where patternString is “<b>(.*?)</b>” to match and extract text within <b> tags. Afterward, we use a while loop with the find() method to iterate over all matches and add them to the list named extractedTexts.
Finally, we assert that two texts (“Baeldung” and “extracting text“) are extracted from the <b> tags.
To handle cases where tag contents may contain newlines, we can modify the pattern string by adding (?s) as follows:
String patternString = "(?s)<" + tagName + ">(.*?)</" + tagName + ">";
Here, we use a regex pattern “(?s)<p>(.*?)</p>” with dotall mode enabled (?s) to match <p> tags across multiple lines.
3. Using JSoup for HTML Parsing and Extraction
For more complex HTML parsing tasks, especially those involving nested tags, using a dedicated library like JSoup is recommended. Let’s demonstrate how to use JSoup to extract text from <p> tags, including handling nested tags:
@Test
void givenHtmlContentWithNestedParagraphTags_thenExtractAllTextsFromHtmlTag() {
String htmlContent = "<div>This is a <p>multiline\nparagraph <strong>with nested</strong> content</p> and <p>line breaks</p>.</div>";
Document doc = Jsoup.parse(htmlContent);
Elements paragraphElements = doc.select("p");
List<String> extractedTexts = new ArrayList<>();
for (Element paragraphElement : paragraphElements) {
String extractedText = paragraphElement.text();
extractedTexts.add(extractedText);
}
assertEquals(2, extractedTexts.size());
assertEquals("multiline paragraph with nested content", extractedTexts.get(0));
assertEquals("line breaks", extractedTexts.get(1));
}
Here, we use the parse() method to parse the htmlContent string, converting it into a Document object. Next, we employ the select() method on the doc object to select all <p> elements within the parsed document.
Subsequently, we iterate over the selected paragraphElements collection, extracting text content from each <p> element using the paragraphElement.text() method.
4. Conclusion
In conclusion, we have explored different approaches to extracting text from HTML tags in Java. Firstly, we discussed using the Pattern and Matcher classes for regex-based text extraction. Additionally, we examined leveraging JSoup for more complex HTML parsing tasks.
As always, the complete source code for the examples is available over on GitHub.