Core functions

Handling date extraction

htmldate.core.find_date(htmlobject, extensive_search=True, original_date=False, outputformat='%Y-%m-%d', url=None, verbose=False, min_date=None, max_date=None, deferred_url_extractor=False)[source]

Extract dates from HTML documents using markup analysis and text patterns

Parameters:
  • htmlobject (string or lxml tree) – Two possibilities: 1. HTML document (e.g. body of HTTP request or .html-file) in text string form or LXML parsed tree or 2. URL string (gets detected automatically)

  • extensive_search (boolean) – Activate pattern-based opportunistic text search

  • original_date (boolean) – Look for original date (e.g. publication date) instead of most recent one (e.g. last modified, updated time)

  • outputformat (string) – Provide a valid datetime format for the returned string (see datetime.strftime())

  • url (string) – Provide an URL manually for pattern-searching in URL (in some cases much faster)

  • verbose (boolean) – Set verbosity level for debugging

  • min_date (datetime, string) – Set the earliest acceptable date manually (ISO 8601 YMD format)

  • max_date (datetime, string) – Set the latest acceptable date manually (ISO 8601 YMD format)

  • deferred_url_extractor (boolean) – Use url extractor as backup only to prioritize full expressions, e.g. of the type %Y-%m-%d %H:%M:%S

Returns:

Returns a valid date expression as a string, or None

Return type:

str | None

htmldate.core.examine_header(tree, options)[source]

Parse header elements to find date cues

Parameters:
  • tree (LXML tree) – LXML parsed tree object

  • options (Extractor) – Options for extraction

Returns:

Returns a valid date expression as a string, or None

Return type:

str | None

htmldate.core.search_page(htmlstring, options)[source]

Opportunistically search the HTML text for common text patterns

Parameters:
  • htmlstring (string) – The HTML document in string format, potentially cleaned and stripped to the core (much faster)

  • options (Extractor) – Define extraction options

Returns:

Returns a valid date expression as a string, or None

Return type:

str | None

Useful internal functions

htmldate.extractors.try_date_expr(string, outputformat, extensive_search, min_date, max_date)[source]

Use a series of heuristics and rules to parse a potential date expression

Parameters:
Return type:

str | None

htmldate.extractors.custom_parse(string, outputformat, min_date, max_date)[source]

Try to bypass the slow dateparser

Parameters:
Return type:

str | None

htmldate.extractors.regex_parse(string)[source]

Try full-text parse for date elements using a series of regular expressions with particular emphasis on English, French, German and Turkish

Parameters:

string (str)

Return type:

datetime | None

htmldate.extractors.extract_url_date(testurl, options)[source]

Extract the date out of an URL string complying with the Y-M-D format

Parameters:
  • testurl (str | None)

  • options (Extractor)

Return type:

str | None

htmldate.extractors.external_date_parser(string, outputformat)[source]

Use dateutil parser or dateparser module according to system settings

Parameters:
  • string (str)

  • outputformat (str)

Return type:

str | None

Helpers

htmldate.extractors.convert_date(datestring, inputformat, outputformat)[source]

Parse date and return string in desired format

Parameters:
  • datestring (str)

  • inputformat (str)

  • outputformat (str)

Return type:

str

htmldate.utils.load_html(htmlobject)[source]

Load object given as input and validate its type (accepted: lxml.html tree, bytestring and string)

Parameters:

htmlobject (bytes | str | HtmlElement)

Return type:

HtmlElement | None

htmldate.utils.fetch_url(url)[source]

Fetches page using urllib3 and decodes the response.

Parameters:

url (str) – URL of the page to fetch.

Returns:

HTML code as string, or Urllib3 response object (headers + body), or empty string in case the result is invalid, or None if there was a problem with the network.

Return type:

str | None