Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4535

Parsing HTML Table in Java With Jsoup

$
0
0

1. Overview

Jsoup is an open-source library used to scrape HTML pages. It provides an API for data parsing, extraction, and manipulation using DOM API methods.

In this article, we will see how to parse an HTML table using Jsoup. We will be retrieving and updating data from the HTML table and also, adding and deleting rows in the table using Jsoup.

2. Dependencies

To use the Jsoup library, add the following dependency to the project:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
</dependency>

We can find the latest version of the Jsoup library in the Maven central repository.

3. Table Structure

To illustrate parsing HTML tables via jsoup, we will be using a sample HTML structure. The complete HTML structure is available in the code base provided in the GitHub repository mentioned at the end of the article. Here, we are showing a table with only two rows of data for representational purposes:

<table>
    <thead>
        <tr>
            <th>Name</th>
            <th>Maths</th>
            <th>English</th>
            <th>Science</th>
         </tr>
    </thead>
    <tbody>
        <tr>
            <td>Student 1</td>
            <td>90</td>
            <td>85</td>
            <td>92</td>
        </tr>
     </tbody>
</table>

As we can see, we are parsing the table with a header row with thead tag followed by data rows in the tbody tag. We are assuming that the table in the HTML document will be in the above format.

4. Parsing Table

Firstly, to select an HTML table from the parsed document, we can use the code snippet below:

Element table = doc.select("table");
Elements rows = table.select("tr"); 
Elements first = rows.get(0).select("th,td");

As we can see, the table element is selected from the document, and then, to get the row element, tr is selected from the table element. As there are multiple rows in the table, we have selected the th or td elements in the first row. By using these functions, we can write the below function to parse table data.

Here, we are assuming no colspan or rowspan elements are used in the table, and the first row is present with header th tags.

Following is the code for parsing the table:

public List<Map<String, String>> parseTable(Document doc, int tableOrder) {
    Element table = doc.select("table").get(tableOrder);
    Element tbody = table.select("tbody").get(0);
    Elements dataRows = tbody.select("tr");
    Elements headerRow = table.select("tr")
      .get(0)
      .select("th,td");
    List<String> headers = new ArrayList<String>();
    for (Element header : headerRow) {
        headers.add(header.text());
    }
    List<Map<String, String>> parsedDataRows = new ArrayList<Map<String, String>>();
    for (int row = 0; row < dataRows.size(); row++) {
        Elements colVals = dataRows.get(row).select("th,td");
        int colCount = 0;
        Map<String, String> dataRow = new HashMap<String, String>();
        for (Element colVal : colVals) {
            dataRow.put(headers.get(colCount++), colVal.text());
        }
        parsedDataRows.add(dataRow);
    }
    return parsedDataRows;
}

In this function, parameter doc is the HTML document loaded from the file, and tableOrder is the nth table element in the document. We are using List<Map<String, String>> to store a list of dataRows in the table under the tbody element. Each element of the list is a Map representing a dataRow. This Map stores the column name as a key and the row value for that column as a map value. Using a list of Maps makes it easy to access the retrieved data.

The list index represents row numbers, and we can get specific cell data by its map key.

We can verify if table data is retrieved correctly using the test case below:

@Test
public void whenDocumentTableParsed_thenTableDataReturned() {
    JsoupTableParser jsoParser = new JsoupTableParser();
    Document doc = jsoParser.loadFromFile("Students.html");
    List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
    assertEquals("90", tableData.get(0).get("Maths")); 
}

From the JUnit test case, we can confirm that since we have parsed the text of all table cells and stored it in an ArrayList of HashMap objects, each element of the list represents a data row in the table. The row is represented by a HashMap with the key as the column header and cell text as the value. Using this structure, we can easily access table data.

5. Update Elements of the Parsed Table

To insert or update elements while parsing, we can use the below code on the td element retrieved from the row:

colVals.get(colCount++).text(updateValue);

or

colVals.get(colCount++).html(updateValue);

The function to update values in the parsed table would look like below:

public void updateTableData(Document doc, int tableOrder, String updateValue) {
    Element table = doc.select("table").get(tableOrder);
    Element tbody = table.select("tbody").get(0);
    Elements dataRows = tbody.select("tr");
    for (int row = 0; row < dataRows.size(); row++) {
        Elements colVals = dataRows.get(row).select("th,td");
        for (int colCount = 0; colCount < colVals.size(); colCount++) {
            colVals.get(colCount).text(updateValue);
        }
    }
}

In the above function, we are getting data rows from the tbody element of the table. The function traverses each cell of the table and sets its value to the parameter value, updatedValue. It updates all cells to the same value to demonstrate that cell values can be updated using Jsoup. We can update the individual cell values by specifying the row and column index for the data row.

The test below verifies the update function:

@Test
public void whenTableUpdated_thenUpdatedDataReturned() {
    JsoupTableParser jsoParser = new JsoupTableParser();
    Document doc = jsoParser.loadFromFile("Students.html");
    jsoParser.updateTableData(doc, 0, "50");
    List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
    assertEquals("50", tableData.get(2).get("Maths"));
}

The JUnit test case confirms that the update operation updates all table cell values to 50. Here we are verifying data from the third data row of the Maths column.

Similarly, we can set desired values for specific cells of the table.

6. Adding Row to the Table

We can add a row to the table using the following function:

public void addRowToTable(Document doc, int tableOrder) {
    Element table = doc.select("table").get(tableOrder);
    Element tbody = table.select("tbody").get(0);
    Elements rows = table.select("tr");
    Elements headerCols = rows.get(0).select("th,td");
    int numCols = headerCols.size();
    Elements colVals = new Elements(numCols);
    for (int colCount = 0; colCount < numCols; colCount++) {
        Element colVal = new Element("td");
        colVal.text("11");
        colVals.add(colVal);
    }
    Elements dataRows = tbody.select("tr");
    Element newDataRow = new Element("tr");
    newDataRow.appendChildren(colVals);
    dataRows.add(newDataRow);
    tbody.html(dataRows.toString());
}

In the above function, we are getting the number of columns from the header row and the data rows from the tbody element of the table. After adding a new row to the dataRows list, we are updating the tbody HTML content with the dataRows.

We can verify row addition using the following test case:

@Test
public void whenTableRowAdded_thenRowCountIncreased() {
    JsoupTableParser jsoParser = new JsoupTableParser();
    Document doc = jsoParser.loadFromFile("Students.html");
    List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
    int countBeforeAdd = tableData.size();
    jsoParser.addRowToTable(doc, 0);
    tableData = jsoParser.parseTable(doc, 0);
    assertEquals(countBeforeAdd + 1, tableData.size());
}

We can confirm from the JUnit test case that the addRowToTable operation on the table increases the number of rows in the table by 1. This operation adds a new row at the end of the list.

Similarly, we can add a row at any position by specifying the index while adding it to the row elements collection.

7. Delete the Row From the Table

We can delete a row from the table using the following function:

public void deleteRowFromTable(Document doc, int tableOrder, int rowNumber) {
    Element table = doc.select("table").get(tableOrder);
    Element tbody = table.select("tbody").get(0);
    Elements dataRows = tbody.select("tr");
    if (rowNumber < dataRows.size()) {
        dataRows.remove(rowNumber);
    }
}

In the above function, we are getting the tbody element of the table. From tbody, we are getting a list of dataRows. From the list of dataRows, we are deleting the row at the rowNumber position in the table. We can verify row deletion using the following test case:

@Test
public void whenTableRowDeleted_thenRowCountDecreased() {
    JsoupTableParser jsoParser = new JsoupTableParser();
    Document doc = jsoParser.loadFromFile("Students.html");
    List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
    int countBeforeDel = tableData.size();
    jsoParser.deleteRowFromTable(doc, 0, 2);
    tableData = jsoParser.parseTable(doc, 0);
    assertEquals(countBeforeDel - 1, tableData.size());
}

The JUnit test case confirms that the deleteRowFromTable operation on the table reduces the number of rows in the table by 1.

Similarly, we can delete a row at any position by specifying the index while removing it from the row elements collection.

8. Conclusion

In this article, we have seen how we can use jsoup to parse HTML tables from HTML documents. Also, we can update table structure as well as table cell data. As always, the source for these examples is available over on GitHub.

       

Viewing all articles
Browse latest Browse all 4535

Trending Articles