pybrokk.duster¶
Module Contents¶
Functions¶
|
Prepares a pandas dataframe by webscraping raw text from a list of urls ready to be input into a machine learning model. |
- pybrokk.duster.duster(urls)[source]¶
Prepares a pandas dataframe by webscraping raw text from a list of urls ready to be input into a machine learning model.
- Parameters:
urls (list) – list of target urls as strings
- Returns:
df – A dataframe with the webpage identifiers as a index, the raw url, and the raw text from the webpage with extra line breaks removed.
- Return type:
pandas dataframe
Examples
>>> from pybrokk.duster import duster >>> duster(['https://www.cnn.com/world', 'https://www.foxnews.com/world', 'https://www.cbc.ca/news/world']) url raw_text id cnn1 https://www.cnn.com/world World news - breaking news, video, headlines ... foxnews1 https://www.foxnews.com/world World | Fox NewsFox News U.S.PoliticsMediaOp... cbc1 https://www.cbc.ca/news/world World - CBC NewsContentSkip to Main ContentAcc...