Example usage

This notebook gives an example on how to use pybrokk in a project:

In this example we start with a selection of a few top universities in Canada, and:

Function

input

output

create_id()

a list of url’s

a list of unique url_id’s

text_from_url()

a list of url’s

a dictionary of scraped raw text

duster()

a list of url’s

a daframe where the outputs of create_id() and text_from_url() are concatonated

bow()

the output of duster()

a dataframe of bag of words appended to the input dataframe.

List of url’s

Here is the list of university urls that will be used in this example:

  • University of Toronto: https://www.utoronto.ca/

  • University of British Columbia: https://www.ubc.ca/

  • McGill University: https://www.mcgill.ca/

  • Queen’s University: https://www.queensu.ca/

Imports

from pybrokk.pybrokk import create_id, text_from_url, duster, bow
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Example input

According to the list of universities mentioned above, here is a sample input we need for some functions in this package:

urls = ['https://www.utoronto.ca/',
         'https://www.ubc.ca/',
         'https://www.mcgill.ca/',
         'https://www.queensu.ca/']

create_id():

Create unique ID’s for a list of urls.

url_ids = create_id(urls)
url_ids
['utoronto1', 'ubc1', 'mcgill1', 'queensu1']

text_from_url():

Create a dictionary in which keys are the url’s and values are the raw text parsed by BeautifulSoup

dictionary = text_from_url(urls)

A first component of this dictionary is going to look like:

list(dictionary.items())[0]
('https://www.utoronto.ca/',
 "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUniversity of Toronto\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to main   content      \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\nEmail\nQuercus\nAcorn\n\n\n\n \nJump ToNews & Media\nAbout U of T\nGive To U of T\nAcademics\nPrograms of Study\nResearch & Innovation\nUniversity Life\nLibraries\nA-Z Directory\n\nSearch\n\n \n\n\n\n\n\n\n\n\n\n\n\nEmail\nQuercus\nAcorn\n \n\n\n\n\n \nFuture Students\nCurrent Students\nAlumni\n\n \n\n\n\n \nFaculty & Staff\nDonors\nVisitors\n\n \n\n\n\n\n\n \nNews & Media\nAbout U of T\nGive to U of T\nAcademics\nResearch & Innovation\nUniversity Life\nLibraries\nPrograms of Study\nA to Z\n\n \n\n\n\n\n\n\n\n \nFuture Students\nCurrent Students\nAlumni\nFaculty & Staff\nDonors\nVisitors\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n            What can we help you with?          \n\n\n\n\n\n\n\n \n\n\n \n\n\n\n \n\n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nU \n                            of T News\n                          \n\n\n\n\n\n\n\n\n                          U of T community members recognized with the Order of Canada                        \n\n\n\n\n                        [node:body]                      \n\n\n                          More                        \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n \n\n\n\n\n\nCampus Status \n\n\n\n\n ok  \n\n \n\n\n\n\n\n\n\n\n\n\xa0\n\n\n\n\n\nYour guide to the U of T community\n\nVisit UTogether\n\n\n\n\n\n\n\n\n\n\n\nLatest news\n\n\n\n\n\n\n\n\n\n  \n    \n      January 31, 2023\n      \n\nStudents dig into the past during 'Ancient Food Day'\n\n\n\n \n\n\n\n\n\n\n\n  \n    \n      January 31, 2023\n      \n\n‘Liquid windows’ inspired by squid skin could help buildings save energy\n\n\n\n\n  \n    \n      January 31, 2023\n      \n\nLink between coffee and kidney disease may depend on genetic variant, study finds\n\n\n\n \n\n\n\nMORE U OF T NEWS\n\n\n\n\n\n\nUpcoming events\n\n\n\n\n\n      Upcoming Events - Home    \n\n\n\n\n\nFebruary 2, 2023\nRe/Viewing, Re/Visioning, and Re/Imagining Black Canada Symposium\nFebruary 2, 2023\nBook Launch and Concert with Amir Issaa\nFebruary 3, 2023\nResearch Program Planning Workshop III \n \n\n\n\n\n\nMORE Events\n\n\n\n\n\n\n\n\n\n\nU of T Celebrates\nThe University of Toronto is home to some of the world’s top faculty, students, alumni and staff. U of T Celebrates recognizes their award-winning accomplishments.\n\nExplore U of T Celebrates\n\n\n\n\n\nRESEARCH & INNOVATION\n\nAngela Schoellig recognized with\nArthur B. McDonald Fellowship\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIn our latest issue: Meet the U of T alum pushing the boundaries of what computers can do; addressing Canada’s racial health gap; investigating crime, one fake body at a time. Plus: Ukrainian students find a haven at U of T; the future of urban farming and more\n\xa0\nExplore the issue\n\n\n\n\n\n\n\n\n\n\n\n\n\n                                    Profile                              \nPeople\nThe Costs of Extraction\nKristen Bos investigates how pollution has affected – and continues to affect – Indigenous communities\n\n\n\n\n\n\n\n\n\n \n\n \n\n \n\n\n\n\n\n \n\n Future Students\nCurrent Students\nAlumni\nFaculty & Staff\nDonors\nVisitors\n  News & Media\nAbout U of T\nGive to U of T\nAcademics\nPrograms of Study\nResearch & Innovation\nUniversity Life\nLibraries\nA-Z Directory\n  \n\n Contacts\nCareers\nAccessibility\nPrivacy\nSite Feedback\nSite Map\n  St. George Campus\nMississauga Campus\nScarborough Campus\nCampus Maps\nCampus Safety\n  \n\n\n ok  \n\n  \n\n \n\n\n\n \nStatement of Land Acknowledgement\nWe wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and the Mississaugas of the Credit. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land.\xa0Read about U of T’s Statement of Land Acknowledgement.\n\n \n\n \n\n\n\n\n\n\n\n\n\nSOCIAL MEDIA DIRECTORY\n\n\nUNIVERSITY OF TORONTO - SINCE 1827\n\n\n\n \n\n\n\n \n\n\n\n\n\n\n")

duster():

Create a dataframe out of the outputs of create_id() and text_from_url()

df = duster(urls)
df
url raw_text
id
utoronto1 https://www.utoronto.ca/ University of TorontoSkip to main content ...
ubc1 https://www.ubc.ca/ The University of British ColumbiaSkip to main...
mcgill1 https://www.mcgill.ca/ McGill UniversityWINTER 2023 / HIVER 2023A saf...
queensu1 https://www.queensu.ca/ Home | Queen's UniversitySkip to main content ...

bow():

Create a dataframe of bag of words appended to the input dataframe.

df_bow = bow(df)
df_bow
url raw_text 0g4get 10 15 1827 18th 19 1v7tel 1z4tel ... year years you youbalancing young younger youon your youth zsocial
id
utoronto1 https://www.utoronto.ca/ University of TorontoSkip to main content ... 0 0 0 1 0 0 0 0 ... 0 1 1 0 0 0 0 1 0 0
ubc1 https://www.ubc.ca/ The University of British ColumbiaSkip to main... 0 0 0 0 0 1 1 1 ... 1 0 1 1 1 0 1 1 0 0
mcgill1 https://www.mcgill.ca/ McGill UniversityWINTER 2023 / HIVER 2023A saf... 1 0 0 0 1 0 0 0 ... 1 0 0 0 0 0 0 3 0 0
queensu1 https://www.queensu.ca/ Home | Queen's UniversitySkip to main content ... 0 1 3 0 0 2 0 0 ... 4 0 2 0 0 1 0 1 1 1

4 rows × 810 columns

The df_bow is going to be a slightly well-shaped dataframe which we always need to start with in our machine learning projects.