pybrokk.duster

Module Contents

Functions

duster(urls)

Prepares a pandas dataframe by webscraping raw text from a list of urls ready to be input into a machine learning model.

pybrokk.duster.duster(urls)[source]

Prepares a pandas dataframe by webscraping raw text from a list of urls ready to be input into a machine learning model.

Parameters:

urls (list) – list of target urls as strings

Returns:

df – A dataframe with the webpage identifiers as a index, the raw url, and the raw text from the webpage with extra line breaks removed.

Return type:

pandas dataframe

Examples

>>> from pybrokk.duster import duster
>>> duster(['https://www.cnn.com/world', 'https://www.foxnews.com/world', 'https://www.cbc.ca/news/world'])
                                    url                                           raw_text
id
cnn1          https://www.cnn.com/world   World news - breaking news, video, headlines ...
foxnews1  https://www.foxnews.com/world  World | Fox NewsFox News   U.S.PoliticsMediaOp...
cbc1      https://www.cbc.ca/news/world  World - CBC NewsContentSkip to Main ContentAcc...