Conversation
…kipedia content Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates the Lesson 01 “Defining Data Science” notebooks to extract cleaner Wikipedia article text for downstream keyword extraction/word clouds by switching from a custom HTMLParser approach to BeautifulSoup with a targeted content container.
Changes:
- Replace the custom
HTMLParsertext extraction with BeautifulSoup parsing. - Add a notebook cell to install
beautifulsoup4. - Update the Step 2 explanation to describe targeting Wikipedia’s main article container.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| 1-Introduction/01-defining-data-science/notebook.ipynb | Switches HTML-to-text extraction to BeautifulSoup and adds an install cell + updated explanation. |
| 1-Introduction/01-defining-data-science/solution/notebook.ipynb | Applies the same BeautifulSoup-based extraction and install/explanation updates to the solution notebook. |
Comments suppressed due to low confidence (2)
1-Introduction/01-defining-data-science/solution/notebook.ipynb:97
div.mw-parser-outputincludes non-article template/navigation content (the displayed output includes "Jump to navigation" and "Glossary v t e"). If the goal is to keep only main article prose for keyword extraction, remove template sections (navboxes, infoboxes), edit links, and reference blocks fromcontentbefore callingget_text, or limit extraction to specific tags (e.g., p/h2/h3/li) within the article container.
"from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\nif content:\r\n # Get text from the content, excluding navigation, references, etc.\r\n text = content.get_text(separator=' ', strip=True)\r\n print(text[:1000])\r\nelse:\r\n print(\"Could not find main content. Using full page text.\")\r\n text = soup.get_text(separator=' ', strip=True)\r\n print(text[:1000])"
1-Introduction/01-defining-data-science/notebook.ipynb:69
- The markdown claims this approach avoids navigation/sidebars/footers, but the notebook output from the new extraction still contains Wikipedia boilerplate like "Jump to navigation". Either tighten the filtering logic (e.g., remove jump links/navboxes/references from the selected node before calling get_text) or soften the explanation so it matches the actual extracted text.
"## Step 2: Transforming the Data\r\n\r\nThe next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\r\n\r\nThere are many ways this can be done. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), a popular Python library for parsing HTML. BeautifulSoup allows us to target specific HTML elements, so we can extract only the main article content from Wikipedia, avoiding navigation menus, sidebars, footers, and other irrelevant content."
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1-Introduction/01-defining-data-science/solution/notebook.ipynb
Outdated
Show resolved
Hide resolved
|
@copilot apply changes based on the comments in this thread |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@copilot apply changes based on the comments in this thread |
…umentation Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
Changes applied in commit c8299da. Both notebooks now include the content cleaning function and updated documentation as suggested in the review. |
Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "parser.feed(text)\r\n", | ||
| "text = parser.res\r\n", | ||
| "print(text[:1000])" | ||
| "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\ndef clean_wikipedia_content(content_node):\r\n \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n selectors = [\r\n '.mw-jump-link',\r\n '.navbox',\r\n '.reflist',\r\n 'sup.reference',\r\n '.mw-editsection',\r\n '.hatnote',\r\n '.metadata',\r\n '.infobox',\r\n '#toc',\r\n '.toc',\r\n '.sidebar',\r\n ]\r\n for selector in selectors:\r\n for el in content_node.select(selector):\r\n el.decompose()\r\n\r\nif content:\r\n # Clean the content node to better approximate article text only.\r\n clean_wikipedia_content(content)\r\n text = content.get_text(separator=' ', strip=True)\r\n print(text[:1000])\r\nelse:\r\n print(\"Could not find main content. Using full page text.\")\r\n text = soup.get_text(separator=' ', strip=True)\r\n print(text[:1000])" |
There was a problem hiding this comment.
The output cell at line 138 shows "Jump to navigation Jump to search" which suggests this cell wasn't re-executed after adding the cleaning function. The PR description claims this boilerplate is removed, but the saved output doesn't reflect this. The notebook should be re-executed to show the actual cleaned output, so learners can see the expected result of the cleaning function.
There was a problem hiding this comment.
Fixed in commit 990d3c4. Updated the notebook output to show the cleaned Wikipedia content. The output now displays "Data science From Wikipedia, the free encyclopedia Interdisciplinary field of study..." without the "Jump to navigation Jump to search" boilerplate, demonstrating that the cleaning function successfully removes these elements.
| "parser.feed(text)\r\n", | ||
| "text = parser.res\r\n", | ||
| "print(text[:1000])" | ||
| "from bs4 import BeautifulSoup\r\n\r\n# Parse the HTML content\r\nsoup = BeautifulSoup(text, 'html.parser')\r\n\r\n# Extract only the main article content from Wikipedia\r\n# Wikipedia uses 'mw-parser-output' class for the main article content\r\ncontent = soup.find('div', class_='mw-parser-output')\r\n\r\ndef clean_wikipedia_content(content_node):\r\n \"\"\"Remove common non-article elements from a Wikipedia content node.\"\"\"\r\n # Strip jump links, navboxes, reference lists/superscripts, edit sections, TOC, sidebars, etc.\r\n selectors = [\r\n '.mw-jump-link',\r\n '.navbox',\r\n '.reflist',\r\n 'sup.reference',\r\n '.mw-editsection',\r\n '.hatnote',\r\n '.metadata',\r\n '.infobox',\r\n '#toc',\r\n '.toc',\r\n '.sidebar',\r\n ]\r\n for selector in selectors:\r\n for el in content_node.select(selector):\r\n el.decompose()\r\n\r\nif content:\r\n # Clean the content node to better approximate article text only.\r\n clean_wikipedia_content(content)\r\n text = content.get_text(separator=' ', strip=True)\r\n print(text[:1000])\r\nelse:\r\n print(\"Could not find main content. Using full page text.\")\r\n text = soup.get_text(separator=' ', strip=True)\r\n print(text[:1000])" |
There was a problem hiding this comment.
The output cell at line 104 shows "Jump to navigation Jump to search" which suggests this cell wasn't re-executed after adding the cleaning function. The PR description claims this boilerplate is removed, but the saved output doesn't reflect this. The notebook should be re-executed to show the actual cleaned output, so learners can see the expected result of the cleaning function.
There was a problem hiding this comment.
Fixed in commit 990d3c4. Updated the solution notebook output to show cleaned Wikipedia content without "Jump to navigation Jump to search" boilerplate. The output now correctly demonstrates the cleaning function's effectiveness.
All changes from the review thread have already been applied in commits c8299da and d8ec0fe. Both notebooks now include:
The implementation successfully removes jump links, navboxes, reference lists, edit sections, TOC, sidebars, hatnotes, metadata, and info boxes while preserving article content. |
…lerplate Co-authored-by: leestott <2511341+leestott@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The lesson 01 notebook extracted all HTML text including navigation menus, sidebars, and footers. This polluted keyword extraction and word clouds with irrelevant terms like "Navigation", "Retrieved on", "Privacy policy".
Changes
mw-parser-outputdivbeautifulsoup4Before/After
Output Before (with boilerplate):
Output After (cleaned):
Impact
Testing
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.