Core functions

Handling date extraction

htmldate.core.find_date(htmlobject, extensive_search=True, original_date=False, outputformat='%Y-%m-%d', url=None, verbose=False, min_date=None, max_date=None)[source]

Extract dates from HTML documents using markup analysis and text patterns

Parameters
  • htmlobject (string or lxml tree) – Two possibilities: 1. HTML document (e.g. body of HTTP request or .html-file) in text string form or LXML parsed tree or 2. URL string (gets detected automatically)

  • extensive_search (boolean) – Activate pattern-based opportunistic text search

  • original_date (boolean) – Look for original date (e.g. publication date) instead of most recent one (e.g. last modified, updated time)

  • outputformat (string) – Provide a valid datetime format for the returned string (see datetime.strftime())

  • url (string) – Provide an URL manually for pattern-searching in URL (in some cases much faster)

  • verbose (boolean) – Set verbosity level for debugging

  • min_date (string) – Set the earliest acceptable date manually (YYYY-MM-DD format)

  • max_date (string) – Set the latest acceptable date manually (YYYY-MM-DD format)

Returns

Returns a valid date expression as a string, or None

htmldate.core.examine_header(tree, outputformat, extensive_search, original_date, min_date, max_date)[source]

Parse header elements to find date cues

Parameters
  • tree – LXML parsed tree object

  • outputformat (string) – Provide a valid datetime format for the returned string (see datetime.strftime())

  • extensive_search (boolean) – Activate pattern-based opportunistic text search

  • original_date (boolean) – Look for original date (e.g. publication date) instead of most recent one (e.g. last modified, updated time)

  • min_date (string) – Set the earliest acceptable date manually (YYYY-MM-DD format)

  • max_date (string) – Set the latest acceptable date manually (YYYY-MM-DD format)

Returns

Returns a valid date expression as a string, or None

htmldate.core.search_page(htmlstring, outputformat, original_date, min_date, max_date)[source]

Opportunistically search the HTML text for common text patterns

Parameters
  • htmlstring (string) – The HTML document in string format, potentially cleaned and stripped to the core (much faster)

  • outputformat (string) – Provide a valid datetime format for the returned string (see datetime.strftime())

  • original_date (boolean) – Look for original date (e.g. publication date) instead of most recent one (e.g. last modified, updated time)

Returns

Returns a valid date expression as a string, or None

Useful internal functions

htmldate.extractors.try_ymd_date(string, outputformat, extensive_search, min_date, max_date)[source]

Use a series of heuristics and rules to parse a potential date expression

htmldate.extractors.custom_parse(string, outputformat, extensive_search, min_date, max_date)[source]

Try to bypass the slow dateparser

htmldate.extractors.regex_parse(string)[source]

Full-text parse using a series of regular expressions

htmldate.extractors.extract_url_date(testurl, outputformat)[source]

Extract the date out of an URL string complying with the Y-M-D format

htmldate.extractors.extract_partial_url_date(testurl, outputformat)[source]

Extract an approximate date out of an URL string in Y-M format

htmldate.extractors.external_date_parser(string, outputformat)[source]

Use dateutil parser or dateparser module according to system settings

Helpers

htmldate.extractors.convert_date(datestring, inputformat, outputformat)[source]

Parse date and return string in desired format

htmldate.extractors.date_validator(date_input, outputformat, earliest=datetime.date(1995, 1, 1), latest=datetime.date(2021, 12, 3))[source]

Validate a string w.r.t. the chosen outputformat and basic heuristics

htmldate.utils.load_html(htmlobject)[source]

Load object given as input and validate its type. Accepted: LXML tree, bytestring and string (HTML document or URL). Raises ValueError if a URL is passed without result.

htmldate.utils.fetch_url(url)[source]

Fetches page using urllib3 and decodes the response.

Parameters

url – URL of the page to fetch.

Returns

HTML code as string, or Urllib3 response object (headers + body), or empty string in case the result is invalid, or None if there was a problem with the network.