Core functions¶
Handling date extraction¶
- htmldate.core.find_date(htmlobject, extensive_search=True, original_date=False, outputformat='%Y-%m-%d', url=None, verbose=False, min_date=None, max_date=None, deferred_url_extractor=False)[source]¶
Extract dates from HTML documents using markup analysis and text patterns
- Parameters:
htmlobject (string or lxml tree) – Two possibilities: 1. HTML document (e.g. body of HTTP request or .html-file) in text string form or LXML parsed tree or 2. URL string (gets detected automatically)
extensive_search (boolean) – Activate pattern-based opportunistic text search
original_date (boolean) – Look for original date (e.g. publication date) instead of most recent one (e.g. last modified, updated time)
outputformat (string) – Provide a valid datetime format for the returned string (see datetime.strftime())
url (string) – Provide an URL manually for pattern-searching in URL (in some cases much faster)
verbose (boolean) – Set verbosity level for debugging
min_date (datetime, string) – Set the earliest acceptable date manually (ISO 8601 YMD format)
max_date (datetime, string) – Set the latest acceptable date manually (ISO 8601 YMD format)
deferred_url_extractor (boolean) – Use url extractor as backup only to prioritize full expressions, e.g. of the type %Y-%m-%d %H:%M:%S
- Returns:
Returns a valid date expression as a string, or None
- Return type:
str | None
- htmldate.core.examine_header(tree, outputformat, extensive_search, original_date, min_date, max_date)[source]¶
Parse header elements to find date cues
- Parameters:
tree (HtmlElement) – LXML parsed tree object
outputformat (string) – Provide a valid datetime format for the returned string (see datetime.strftime())
extensive_search (boolean) – Activate pattern-based opportunistic text search
original_date (boolean) – Look for original date (e.g. publication date) instead of most recent one (e.g. last modified, updated time)
min_date (datetime) – Set the earliest acceptable date manually (ISO 8601 YMD format)
max_date (datetime) – Set the latest acceptable date manually (ISO 8601 YMD format)
- Returns:
Returns a valid date expression as a string, or None
- Return type:
str | None
- htmldate.core.search_page(htmlstring, outputformat, original_date, min_date, max_date)[source]¶
Opportunistically search the HTML text for common text patterns
- Parameters:
htmlstring (string) – The HTML document in string format, potentially cleaned and stripped to the core (much faster)
outputformat (string) – Provide a valid datetime format for the returned string (see datetime.strftime())
original_date (boolean) – Look for original date (e.g. publication date) instead of most recent one (e.g. last modified, updated time)
min_date (datetime) – Set the earliest acceptable date manually (ISO 8601 YMD format)
max_date (datetime) – Set the latest acceptable date manually (ISO 8601 YMD format)
- Returns:
Returns a valid date expression as a string, or None
- Return type:
str | None
Useful internal functions¶
- htmldate.extractors.try_date_expr(string, outputformat, extensive_search, min_date, max_date)[source]¶
Use a series of heuristics and rules to parse a potential date expression
- htmldate.extractors.custom_parse(string, outputformat, min_date, max_date)[source]¶
Try to bypass the slow dateparser
- htmldate.extractors.regex_parse(string)[source]¶
Try full-text parse for date elements using a series of regular expressions with particular emphasis on English, French, German and Turkish
- htmldate.extractors.extract_url_date(testurl, outputformat, min_date, max_date)[source]¶
Extract the date out of an URL string complying with the Y-M-D format
Helpers¶
- htmldate.extractors.convert_date(datestring, inputformat, outputformat)[source]¶
Parse date and return string in desired format
- htmldate.extractors.date_validator(date_input, outputformat, earliest, latest)[source]¶
Validate a string w.r.t. the chosen outputformat and basic heuristics
- htmldate.utils.load_html(htmlobject)[source]¶
Load object given as input and validate its type (accepted: lxml.html tree, bytestring and string)