From 5bf0d1126ff560c502c1bb095616e7b9acd39c93 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 16 Jan 2026 08:43:05 +0000
Subject: [PATCH 1/7] Initial plan


From 0fcfd8b3232c7c88702b41338edeb93ed519a781 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Fri, 16 Jan 2026 08:49:20 +0000
Subject: [PATCH 2/7] Replace HTMLParser with BeautifulSoup for extracting only
 relevant Wikipedia content

Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
---
 .../01-defining-data-science/notebook.ipynb   | 46 ++++++++-----------
 .../solution/notebook.ipynb                   | 44 ++++++++----------
 2 files changed, 39 insertions(+), 51 deletions(-)
diff --git a/1-Introduction/01-defining-data-science/notebook.ipynb b/1-Introduction/01-defining-data-science/notebook.ipynb
index cf3988e85..4648caf0b 100644
--- a/1-Introduction/01-defining-data-science/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/notebook.ipynb
@@ -66,38 +66,32 @@
   {
    "cell_type": "markdown",
    "source": [
-    "## Step 2: Transforming the Data\r\n",
-    "\r\n",
-    "The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n",
-    "\r\n",
-    "There are many ways this can be done. We will use the simplest built-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `<script>` and `<style>` tags."
+    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can extract only the main article content from Wikipedia, avoiding navigation menus, sidebars, footers, and other irrelevant content."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "First, we need to install the BeautifulSoup library for HTML parsing:"
    ],
    "metadata": {}
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "import sys\r\n",
+    "!{sys.executable} -m pip install beautifulsoup4"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
   {
    "cell_type": "code",
    "execution_count": 64,
    "source": [
-    "from html.parser import HTMLParser\r\n",
-    "\r\n",
-    "class MyHTMLParser(HTMLParser):\r\n",
-    "    script = False\r\n",
-    "    res = \"\"\r\n",
-    "    def handle_starttag(self, tag, attrs):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = True\r\n",
-    "    def handle_endtag(self, tag):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = False\r\n",
-    "    def handle_data(self, data):\r\n",
-    "        if str.strip(data)==\"\" or self.script:\r\n",
-    "            return\r\n",
-    "        self.res += ' '+data.replace('[ edit ]','')\r\n",
-    "\r\n",
-    "parser = MyHTMLParser()\r\n",
-    "parser.feed(text)\r\n",
-    "text = parser.res\r\n",
-    "print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\nif content:\r\n    # Get text from the content, excluding navigation, references, etc.\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
    ],
    "outputs": [
     {
@@ -416,4 +410,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
+}
\ No newline at end of file
diff --git a/1-Introduction/01-defining-data-science/solution/notebook.ipynb b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
index ac2c55247..4a13c9eea 100644
--- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
@@ -69,38 +69,32 @@
   {
    "cell_type": "markdown",
    "source": [
-    "## Step 2: Transforming the Data\r\n",
-    "\r\n",
-    "The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n",
-    "\r\n",
-    "There are many ways this can be done. We will use the simplest build-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `<script>` and `<style>` tags."
+    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can extract only the main article content from Wikipedia, avoiding navigation menus, sidebars, footers, and other irrelevant content."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "First, we need to install the BeautifulSoup library for HTML parsing:"
    ],
    "metadata": {}
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "import sys\r\n",
+    "!{sys.executable} -m pip install beautifulsoup4"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
    "source": [
-    "from html.parser import HTMLParser\r\n",
-    "\r\n",
-    "class MyHTMLParser(HTMLParser):\r\n",
-    "    script = False\r\n",
-    "    res = \"\"\r\n",
-    "    def handle_starttag(self, tag, attrs):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = True\r\n",
-    "    def handle_endtag(self, tag):\r\n",
-    "        if tag.lower() in [\"script\",\"style\"]:\r\n",
-    "            self.script = False\r\n",
-    "    def handle_data(self, data):\r\n",
-    "        if str.strip(data)==\"\" or self.script:\r\n",
-    "            return\r\n",
-    "        self.res += ' '+data.replace('[ edit ]','')\r\n",
-    "\r\n",
-    "parser = MyHTMLParser()\r\n",
-    "parser.feed(text)\r\n",
-    "text = parser.res\r\n",
-    "print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\nif content:\r\n    # Get text from the content, excluding navigation, references, etc.\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
    ],
    "outputs": [
     {

From 0634419bceec7ff8285a54e0698afbee098a603c Mon Sep 17 00:00:00 2001
From: Lee Stott <leestott@microsoft.com>
Date: Tue, 17 Feb 2026 07:53:54 +0000
Subject: [PATCH 3/7] Update
 1-Introduction/01-defining-data-science/solution/notebook.ipynb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 1-Introduction/01-defining-data-science/solution/notebook.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/1-Introduction/01-defining-data-science/solution/notebook.ipynb b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
index 4a13c9eea..a7853ecb8 100644
--- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
@@ -69,7 +69,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can extract only the main article content from Wikipedia, avoiding navigation menus, sidebars, footers, and other irrelevant content."
+    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can focus on the main article content from Wikipedia and reduce some navigation menus, sidebars, footers, and other irrelevant content (though some boilerplate text may still remain)."
    ],
    "metadata": {}
   },

From 3a341157017f82a9eacf262d65f00b7ccefab60c Mon Sep 17 00:00:00 2001
From: Lee Stott <leestott@microsoft.com>
Date: Tue, 17 Feb 2026 07:54:09 +0000
Subject: [PATCH 4/7] Update
 1-Introduction/01-defining-data-science/notebook.ipynb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 .../01-defining-data-science/notebook.ipynb   | 39 ++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/1-Introduction/01-defining-data-science/notebook.ipynb b/1-Introduction/01-defining-data-science/notebook.ipynb
index 4648caf0b..e571610fb 100644
--- a/1-Introduction/01-defining-data-science/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/notebook.ipynb
@@ -91,7 +91,44 @@
    "cell_type": "code",
    "execution_count": 64,
    "source": [
-    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\nif content:\r\n    # Get text from the content, excluding navigation, references, etc.\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n",
+    "\r\n",
+    "# Parse the HTML content\r\n",
+    "soup = BeautifulSoup(text, 'html.parser')\r\n",
+    "\r\n",
+    "# Extract only the main article content from Wikipedia\r\n",
+    "# Wikipedia uses 'mw-parser-output' class for the main article content\r\n",
+    "content = soup.find('div', class_='mw-parser-output')\r\n",
+    "\r\n",
+    "def _clean_wikipedia_content(content_node):\r\n",
+    "    \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n",
+    "    # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n",
+    "    selectors = [\r\n",
+    "        '.mw-jump-link',\r\n",
+    "        '.navbox',\r\n",
+    "        '.reflist',\r\n",
+    "        'sup.reference',\r\n",
+    "        '.mw-editsection',\r\n",
+    "        '.hatnote',\r\n",
+    "        '.metadata',\r\n",
+    "        '.infobox',\r\n",
+    "        '#toc',\r\n",
+    "        '.toc',\r\n",
+    "        '.sidebar',\r\n",
+    "    ]\r\n",
+    "    for selector in selectors:\r\n",
+    "        for el in content_node.select(selector):\r\n",
+    "            el.decompose()\r\n",
+    "\r\n",
+    "if content:\r\n",
+    "    # Clean the content node to better approximate article text only.\r\n",
+    "    _clean_wikipedia_content(content)\r\n",
+    "    text = content.get_text(separator=' ', strip=True)\r\n",
+    "    print(text[:1000])\r\n",
+    "else:\r\n",
+    "    print(\"Could not find main content. Using full page text.\")\r\n",
+    "    text = soup.get_text(separator=' ', strip=True)\r\n",
+    "    print(text[:1000])"
    ],
    "outputs": [
     {

From c8299dabb23561fdb6e18b96fec13b5dfb3048e7 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Tue, 17 Feb 2026 07:56:59 +0000
Subject: [PATCH 5/7] Apply reviewer feedback: add content cleaning function
 and update documentation

Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
---
 1-Introduction/01-defining-data-science/notebook.ipynb          | 2 +-
 1-Introduction/01-defining-data-science/solution/notebook.ipynb | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/1-Introduction/01-defining-data-science/notebook.ipynb b/1-Introduction/01-defining-data-science/notebook.ipynb
index e571610fb..7b5644202 100644
--- a/1-Introduction/01-defining-data-science/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/notebook.ipynb
@@ -66,7 +66,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can extract only the main article content from Wikipedia, avoiding navigation menus, sidebars, footers, and other irrelevant content."
+    "## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can focus on the main article content from Wikipedia and reduce some navigation menus, sidebars, footers, and other irrelevant content (though some boilerplate text may still remain)."
    ],
    "metadata": {}
   },
diff --git a/1-Introduction/01-defining-data-science/solution/notebook.ipynb b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
index a7853ecb8..92e616b5e 100644
--- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
@@ -94,7 +94,7 @@
    "cell_type": "code",
    "execution_count": 4,
    "source": [
-    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\nif content:\r\n    # Get text from the content, excluding navigation, references, etc.\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\ndef _clean_wikipedia_content(content_node):\r\n    \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n    # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n    selectors = [\r\n        '.mw-jump-link',\r\n        '.navbox',\r\n        '.reflist',\r\n        'sup.reference',\r\n        '.mw-editsection',\r\n        '.hatnote',\r\n        '.metadata',\r\n        '.infobox',\r\n        '#toc',\r\n        '.toc',\r\n        '.sidebar',\r\n    ]\r\n    for selector in selectors:\r\n        for el in content_node.select(selector):\r\n            el.decompose()\r\n\r\nif content:\r\n    # Clean the content node to better approximate article text only.\r\n    _clean_wikipedia_content(content)\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
    ],
    "outputs": [
     {

From d8ec0fe1e5ebac21afe68c0563cc54ba3253e751 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Tue, 17 Feb 2026 07:58:29 +0000
Subject: [PATCH 6/7] Remove underscore prefix from function name for better
 notebook style

Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
---
 .../01-defining-data-science/notebook.ipynb   | 39 +------------------
 .../solution/notebook.ipynb                   |  2 +-
 2 files changed, 2 insertions(+), 39 deletions(-)

diff --git a/1-Introduction/01-defining-data-science/notebook.ipynb b/1-Introduction/01-defining-data-science/notebook.ipynb
index 7b5644202..35a1af9b5 100644
--- a/1-Introduction/01-defining-data-science/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/notebook.ipynb
@@ -91,44 +91,7 @@
    "cell_type": "code",
    "execution_count": 64,
    "source": [
-    "from bs4 import BeautifulSoup\r\n",
-    "\r\n",
-    "# Parse the HTML content\r\n",
-    "soup = BeautifulSoup(text, 'html.parser')\r\n",
-    "\r\n",
-    "# Extract only the main article content from Wikipedia\r\n",
-    "# Wikipedia uses 'mw-parser-output' class for the main article content\r\n",
-    "content = soup.find('div', class_='mw-parser-output')\r\n",
-    "\r\n",
-    "def _clean_wikipedia_content(content_node):\r\n",
-    "    \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n",
-    "    # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n",
-    "    selectors = [\r\n",
-    "        '.mw-jump-link',\r\n",
-    "        '.navbox',\r\n",
-    "        '.reflist',\r\n",
-    "        'sup.reference',\r\n",
-    "        '.mw-editsection',\r\n",
-    "        '.hatnote',\r\n",
-    "        '.metadata',\r\n",
-    "        '.infobox',\r\n",
-    "        '#toc',\r\n",
-    "        '.toc',\r\n",
-    "        '.sidebar',\r\n",
-    "    ]\r\n",
-    "    for selector in selectors:\r\n",
-    "        for el in content_node.select(selector):\r\n",
-    "            el.decompose()\r\n",
-    "\r\n",
-    "if content:\r\n",
-    "    # Clean the content node to better approximate article text only.\r\n",
-    "    _clean_wikipedia_content(content)\r\n",
-    "    text = content.get_text(separator=' ', strip=True)\r\n",
-    "    print(text[:1000])\r\n",
-    "else:\r\n",
-    "    print(\"Could not find main content. Using full page text.\")\r\n",
-    "    text = soup.get_text(separator=' ', strip=True)\r\n",
-    "    print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\ndef clean_wikipedia_content(content_node):\r\n    \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n    # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n    selectors = [\r\n        '.mw-jump-link',\r\n        '.navbox',\r\n        '.reflist',\r\n        'sup.reference',\r\n        '.mw-editsection',\r\n        '.hatnote',\r\n        '.metadata',\r\n        '.infobox',\r\n        '#toc',\r\n        '.toc',\r\n        '.sidebar',\r\n    ]\r\n    for selector in selectors:\r\n        for el in content_node.select(selector):\r\n            el.decompose()\r\n\r\nif content:\r\n    # Clean the content node to better approximate article text only.\r\n    clean_wikipedia_content(content)\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
    ],
    "outputs": [
     {
diff --git a/1-Introduction/01-defining-data-science/solution/notebook.ipynb b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
index 92e616b5e..75e45a912 100644
--- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
@@ -94,7 +94,7 @@
    "cell_type": "code",
    "execution_count": 4,
    "source": [
-    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\ndef _clean_wikipedia_content(content_node):\r\n    \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n    # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n    selectors = [\r\n        '.mw-jump-link',\r\n        '.navbox',\r\n        '.reflist',\r\n        'sup.reference',\r\n        '.mw-editsection',\r\n        '.hatnote',\r\n        '.metadata',\r\n        '.infobox',\r\n        '#toc',\r\n        '.toc',\r\n        '.sidebar',\r\n    ]\r\n    for selector in selectors:\r\n        for el in content_node.select(selector):\r\n            el.decompose()\r\n\r\nif content:\r\n    # Clean the content node to better approximate article text only.\r\n    _clean_wikipedia_content(content)\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
+    "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\ndef clean_wikipedia_content(content_node):\r\n    \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n    # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n    selectors = [\r\n        '.mw-jump-link',\r\n        '.navbox',\r\n        '.reflist',\r\n        'sup.reference',\r\n        '.mw-editsection',\r\n        '.hatnote',\r\n        '.metadata',\r\n        '.infobox',\r\n        '#toc',\r\n        '.toc',\r\n        '.sidebar',\r\n    ]\r\n    for selector in selectors:\r\n        for el in content_node.select(selector):\r\n            el.decompose()\r\n\r\nif content:\r\n    # Clean the content node to better approximate article text only.\r\n    clean_wikipedia_content(content)\r\n    text = content.get_text(separator=' ', strip=True)\r\n    print(text[:1000])\r\nelse:\r\n    print(\"Could not find main content. Using full page text.\")\r\n    text = soup.get_text(separator=' ', strip=True)\r\n    print(text[:1000])"
    ],
    "outputs": [
     {

From 990d3c469d72037fb3065c3014f4f04c0da9a440 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Tue, 17 Feb 2026 08:07:35 +0000
Subject: [PATCH 7/7] Update notebook outputs to show cleaned Wikipedia content
 without boilerplate

Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
---
 1-Introduction/01-defining-data-science/notebook.ipynb          | 2 +-
 1-Introduction/01-defining-data-science/solution/notebook.ipynb | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/1-Introduction/01-defining-data-science/notebook.ipynb b/1-Introduction/01-defining-data-science/notebook.ipynb
index 35a1af9b5..0e9199fc0 100644
--- a/1-Introduction/01-defining-data-science/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/notebook.ipynb
@@ -98,7 +98,7 @@
      "output_type": "stream",
      "name": "stdout",
      "text": [
-      " Data science - Wikipedia Data science From Wikipedia, the free encyclopedia Jump to navigation Jump to search Interdisciplinary field of study focused on deriving knowledge and insights from data Not to be confused with  information science . The existence of  Comet NEOWISE  (here depicted as a series of red dots) was discovered by analyzing  astronomical survey  data acquired by a  space telescope , the  Wide-field Infrared Survey Explorer . Part of a series on Machine learning and  data mining Problems Classification Clustering Regression Anomaly detection AutoML Association rules Reinforcement learning Structured prediction Feature engineering Feature learning Online learning Semi-supervised learning Unsupervised learning Learning to rank Grammar induction Supervised learning ( classification  •  regression ) Decision trees Ensembles Bagging Boosting Random forest k -NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine \n"
+      "Data science From Wikipedia, the free encyclopedia Interdisciplinary field of study focused on deriving knowledge and insights from data Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data. Data science also integrates domain knowledge from the underlying application domain. Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.\n"
      ]
     }
    ],
diff --git a/1-Introduction/01-defining-data-science/solution/notebook.ipynb b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
index 75e45a912..bd376ae15 100644
--- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb
@@ -101,7 +101,7 @@
      "output_type": "stream",
      "name": "stdout",
      "text": [
-      " Machine learning - Wikipedia Machine learning From Wikipedia, the free encyclopedia Jump to navigation Jump to search Study of algorithms that improve automatically through experience For the journal, see  Machine Learning (journal) . \"Statistical learning\" redirects here. For statistical learning in linguistics, see  statistical learning in language acquisition . Part of a series on Artificial intelligence Major goals Artificial general intelligence Planning Computer vision General game playing Knowledge reasoning Machine learning Natural language processing Robotics Approaches Symbolic Deep learning Bayesian networks Evolutionary algorithms Philosophy Ethics Existential risk Turing test Chinese room Control problem Friendly AI History Timeline Progress AI winter Technology Applications Projects Programming languages Glossary Glossary v t e Part of a series on Machine learning and  data mining Problems Classification Clustering Regression Anomaly detection Data Cleaning AutoML Associ\n"
+      "Machine learning From Wikipedia, the free encyclopedia Study of algorithms that improve automatically through experience Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance. ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine.\n"
      ]
     }
    ],