Example usage¶
This notebook gives an example on how to use pybrokk in a project:
In this example we start with a selection of a few top universities in Canada, and:
Function |
input |
output |
|---|---|---|
|
a list of url’s |
a list of unique url_id’s |
|
a list of url’s |
a dictionary of scraped raw text |
|
a list of url’s |
a daframe where the outputs of |
|
the output of |
a dataframe of bag of words appended to the input dataframe. |
List of url’s¶
Here is the list of university urls that will be used in this example:
University of Toronto: https://www.utoronto.ca/
University of British Columbia: https://www.ubc.ca/
McGill University: https://www.mcgill.ca/
Queen’s University: https://www.queensu.ca/
Imports¶
from pybrokk.pybrokk import create_id, text_from_url, duster, bow
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
Example input¶
According to the list of universities mentioned above, here is a sample input we need for some functions in this package:
urls = ['https://www.utoronto.ca/',
'https://www.ubc.ca/',
'https://www.mcgill.ca/',
'https://www.queensu.ca/']
create_id():¶
Create unique ID’s for a list of urls.¶
url_ids = create_id(urls)
url_ids
['utoronto1', 'ubc1', 'mcgill1', 'queensu1']
text_from_url():¶
Create a dictionary in which keys are the url’s and values are the raw text parsed by BeautifulSoup¶
dictionary = text_from_url(urls)
A first component of this dictionary is going to look like:
list(dictionary.items())[0]
('https://www.utoronto.ca/',
"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUniversity of Toronto\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to main content \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\nEmail\nQuercus\nAcorn\n\n\n\n \nJump ToNews & Media\nAbout U of T\nGive To U of T\nAcademics\nPrograms of Study\nResearch & Innovation\nUniversity Life\nLibraries\nA-Z Directory\n\nSearch\n\n \n\n\n\n\n\n\n\n\n\n\n\nEmail\nQuercus\nAcorn\n \n\n\n\n\n \nFuture Students\nCurrent Students\nAlumni\n\n \n\n\n\n \nFaculty & Staff\nDonors\nVisitors\n\n \n\n\n\n\n\n \nNews & Media\nAbout U of T\nGive to U of T\nAcademics\nResearch & Innovation\nUniversity Life\nLibraries\nPrograms of Study\nA to Z\n\n \n\n\n\n\n\n\n\n \nFuture Students\nCurrent Students\nAlumni\nFaculty & Staff\nDonors\nVisitors\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n What can we help you with? \n\n\n\n\n\n\n\n \n\n\n \n\n\n\n \n\n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nU \n of T News\n \n\n\n\n\n\n\n\n\n U of T community members recognized with the Order of Canada \n\n\n\n\n [node:body] \n\n\n More \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n \n\n\n\n\n\nCampus Status \n\n\n\n\n ok \n\n \n\n\n\n\n\n\n\n\n\n\xa0\n\n\n\n\n\nYour guide to the U of T community\n\nVisit UTogether\n\n\n\n\n\n\n\n\n\n\n\nLatest news\n\n\n\n\n\n\n\n\n\n \n \n January 31, 2023\n \n\nStudents dig into the past during 'Ancient Food Day'\n\n\n\n \n\n\n\n\n\n\n\n \n \n January 31, 2023\n \n\n‘Liquid windows’ inspired by squid skin could help buildings save energy\n\n\n\n\n \n \n January 31, 2023\n \n\nLink between coffee and kidney disease may depend on genetic variant, study finds\n\n\n\n \n\n\n\nMORE U OF T NEWS\n\n\n\n\n\n\nUpcoming events\n\n\n\n\n\n Upcoming Events - Home \n\n\n\n\n\nFebruary 2, 2023\nRe/Viewing, Re/Visioning, and Re/Imagining Black Canada Symposium\nFebruary 2, 2023\nBook Launch and Concert with Amir Issaa\nFebruary 3, 2023\nResearch Program Planning Workshop III \n \n\n\n\n\n\nMORE Events\n\n\n\n\n\n\n\n\n\n\nU of T Celebrates\nThe University of Toronto is home to some of the world’s top faculty, students, alumni and staff. U of T Celebrates recognizes their award-winning accomplishments.\n\nExplore U of T Celebrates\n\n\n\n\n\nRESEARCH & INNOVATION\n\nAngela Schoellig recognized with\nArthur B. McDonald Fellowship\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIn our latest issue: Meet the U of T alum pushing the boundaries of what computers can do; addressing Canada’s racial health gap; investigating crime, one fake body at a time. Plus: Ukrainian students find a haven at U of T; the future of urban farming and more\n\xa0\nExplore the issue\n\n\n\n\n\n\n\n\n\n\n\n\n\n Profile \nPeople\nThe Costs of Extraction\nKristen Bos investigates how pollution has affected – and continues to affect – Indigenous communities\n\n\n\n\n\n\n\n\n\n \n\n \n\n \n\n\n\n\n\n \n\n Future Students\nCurrent Students\nAlumni\nFaculty & Staff\nDonors\nVisitors\n News & Media\nAbout U of T\nGive to U of T\nAcademics\nPrograms of Study\nResearch & Innovation\nUniversity Life\nLibraries\nA-Z Directory\n \n\n Contacts\nCareers\nAccessibility\nPrivacy\nSite Feedback\nSite Map\n St. George Campus\nMississauga Campus\nScarborough Campus\nCampus Maps\nCampus Safety\n \n\n\n ok \n\n \n\n \n\n\n\n \nStatement of Land Acknowledgement\nWe wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and the Mississaugas of the Credit. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land.\xa0Read about U of T’s Statement of Land Acknowledgement.\n\n \n\n \n\n\n\n\n\n\n\n\n\nSOCIAL MEDIA DIRECTORY\n\n\nUNIVERSITY OF TORONTO - SINCE 1827\n\n\n\n \n\n\n\n \n\n\n\n\n\n\n")
duster():¶
Create a dataframe out of the outputs of create_id() and text_from_url()¶
df = duster(urls)
df
| url | raw_text | |
|---|---|---|
| id | ||
| utoronto1 | https://www.utoronto.ca/ | University of TorontoSkip to main content ... |
| ubc1 | https://www.ubc.ca/ | The University of British ColumbiaSkip to main... |
| mcgill1 | https://www.mcgill.ca/ | McGill UniversityWINTER 2023 / HIVER 2023A saf... |
| queensu1 | https://www.queensu.ca/ | Home | Queen's UniversitySkip to main content ... |
bow():¶
Create a dataframe of bag of words appended to the input dataframe.¶
df_bow = bow(df)
df_bow
| url | raw_text | 0g4get | 10 | 15 | 1827 | 18th | 19 | 1v7tel | 1z4tel | ... | year | years | you | youbalancing | young | younger | youon | your | youth | zsocial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| utoronto1 | https://www.utoronto.ca/ | University of TorontoSkip to main content ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| ubc1 | https://www.ubc.ca/ | The University of British ColumbiaSkip to main... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | ... | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
| mcgill1 | https://www.mcgill.ca/ | McGill UniversityWINTER 2023 / HIVER 2023A saf... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
| queensu1 | https://www.queensu.ca/ | Home | Queen's UniversitySkip to main content ... | 0 | 1 | 3 | 0 | 0 | 2 | 0 | 0 | ... | 4 | 0 | 2 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
4 rows × 810 columns
The df_bow is going to be a slightly well-shaped dataframe which we always need to start with in our machine learning projects.