Efficient Table Parsing from Wikipedia with Python and BeautifulSoup

To make the code more efficient and effective in parsing tables from Wikipedia, we’ll address the issues with pd.read_html() as mentioned in the question. Here’s a revised version of the code:

import requests
from bs4 import BeautifulSoup
from io import BytesIO
import pandas as pd

def parse_wikipedia_table(url):
    # Fetch webpage and create DOM
    res = requests.get(url)
    tree = BeautifulSoup(res.text, 'html.parser')

    # Find table in the webpage
    wikitable = tree.find('table', class_='wikitable')

    # If no table found, return None
    if not wikitable:
        return None

    # Extract data from the table using XPath
    rows = wikitable.find_all('tr')
    data = []
    for row in rows[1:]:
        cols = row.find_all(['th', 'td'])
        col_data = [col.text.strip() for col in cols]
        if len(col_data) == 2:
            col_data.append(None)
        data.append(col_data)

    return data

def main():
    url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
    data = parse_wikipedia_table(url)

    # Convert the data to a pandas DataFrame
    df = pd.DataFrame(data[1:], columns=data[0])

    print(df)

if __name__ == '__main__':
    main()

This revised code will fetch the webpage, create a DOM using BeautifulSoup, find the table with class wikitable, extract the data from the table, and convert it to a pandas DataFrame. It should be more efficient and effective in parsing tables from Wikipedia compared to the original code.


Last modified on 2023-11-30