From 5bf0d1126ff560c502c1bb095616e7b9acd39c93 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 16 Jan 2026 08:43:05 +0000 Subject: [PATCH 1/7] Initial plan From 0fcfd8b3232c7c88702b41338edeb93ed519a781 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 16 Jan 2026 08:49:20 +0000 Subject: [PATCH 2/7] Replace HTMLParser with BeautifulSoup for extracting only relevant Wikipedia content Co-authored-by: leestott <2511341+leestott@users.noreply.github.com> --- .../01-defining-data-science/notebook.ipynb | 46 ++++++++----------- .../solution/notebook.ipynb | 44 ++++++++---------- 2 files changed, 39 insertions(+), 51 deletions(-) diff --git a/1-Introduction/01-defining-data-science/notebook.ipynb b/1-Introduction/01-defining-data-science/notebook.ipynb index cf3988e85..4648caf0b 100644 --- a/1-Introduction/01-defining-data-science/notebook.ipynb +++ b/1-Introduction/01-defining-data-science/notebook.ipynb @@ -66,38 +66,32 @@ { "cell_type": "markdown", "source": [ - "## Step 2: Transforming the Data\r\n", - "\r\n", - "The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n", - "\r\n", - "There are many ways this can be done. We will use the simplest built-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `