Core functions¶

Handling date extraction¶

htmldate.core.find_date(htmlobject, extensive_search=True, original_date=False, outputformat='%Y-%m-%d', url=None, verbose=False, min_date=None, max_date=None, deferred_url_extractor=False)[source]¶

Extract dates from HTML documents using markup analysis and text patterns

Parameters:

htmlobject (string or lxml tree) – Two possibilities: 1. HTML document (e.g. body of HTTP request or .html-file) in text string form or LXML parsed tree or 2. URL string (gets detected automatically)
extensive_search (boolean) – Activate pattern-based opportunistic text search
original_date (boolean) – Look for original date (e.g. publication date) instead of most recent one (e.g. last modified, updated time)
outputformat (string) – Provide a valid datetime format for the returned string (see datetime.strftime())
url (string) – Provide an URL manually for pattern-searching in URL (in some cases much faster)
verbose (boolean) – Set verbosity level for debugging
min_date (datetime, string) – Set the earliest acceptable date manually (ISO 8601 YMD format)
max_date (datetime, string) – Set the latest acceptable date manually (ISO 8601 YMD format)
deferred_url_extractor (boolean) – Use url extractor as backup only to prioritize full expressions, e.g. of the type %Y-%m-%d %H:%M:%S

Returns:

Returns a valid date expression as a string, or None

Return type:

str | None

htmldate.core.examine_header(tree, options)[source]¶

Parse header elements to find date cues

Parameters:

tree (LXML tree) – LXML parsed tree object
options (Extractor) – Options for extraction

Returns:

Returns a valid date expression as a string, or None

Return type:

str | None

htmldate.core.search_page(htmlstring, options)[source]¶

Opportunistically search the HTML text for common text patterns

Parameters:

htmlstring (string) – The HTML document in string format, potentially cleaned and stripped to the core (much faster)
options (Extractor) – Define extraction options

Returns:

Returns a valid date expression as a string, or None

Return type:

str | None

Useful internal functions¶

htmldate.extractors.try_date_expr(string, outputformat, extensive_search, min_date, max_date)[source]¶

Use a series of heuristics and rules to parse a potential date expression

Parameters:

string (str | None)
outputformat (str)
extensive_search (bool)
min_date (datetime)
max_date (datetime)

Return type:

str | None

htmldate.extractors.custom_parse(string, outputformat, min_date, max_date)[source]¶

Try to bypass the slow dateparser

Parameters:

string (str)
outputformat (str)
min_date (datetime)
max_date (datetime)

Return type:

str | None

htmldate.extractors.regex_parse(string)[source]¶

Try full-text parse for date elements using a series of regular expressions with particular emphasis on English, French, German and Turkish

Parameters:: string (str)
Return type:: datetime | None

htmldate.extractors.extract_url_date(testurl, options)[source]¶

Extract the date out of an URL string complying with the Y-M-D format

Parameters:

testurl (str | None)
options (Extractor)

Return type:

str | None

htmldate.extractors.external_date_parser(string, outputformat)[source]¶

Use dateutil parser or dateparser module according to system settings

Parameters:

string (str)
outputformat (str)

Return type:

str | None

Helpers¶

htmldate.validators.is_valid_date(date_input, outputformat, earliest, latest)[source]¶

Validate a string w.r.t. the chosen outputformat and basic heuristics

Parameters:

date_input (datetime | str | None)
outputformat (str)
earliest (datetime)
latest (datetime)

Return type:

bool

htmldate.validators.convert_date(datestring, inputformat, outputformat)[source]¶

Parse date and return string in desired format

Parameters:

datestring (str)
inputformat (str)
outputformat (str)

Return type:

str

htmldate.utils.load_html(htmlobject)[source]¶

Load object given as input and validate its type (accepted: lxml.html tree, bytestring and string)

Parameters:: htmlobject (bytes | str | HtmlElement)
Return type:: HtmlElement | None

htmldate.utils.fetch_url(url)[source]¶

Fetches page using urllib3 and decodes the response.

Parameters:: url (str) – URL of the page to fetch.
Returns:: HTML code as string, or Urllib3 response object (headers + body), or empty string in case the result is invalid, or None if there was a problem with the network.
Return type:: str | None

Core functions¶

Handling date extraction¶

Useful internal functions¶

Helpers¶

htmldate

Navigation

Related Topics