Input format

The module expects strings as shown above, it is also possible to use already parsed HTML (i.e. a LXML tree object):

>>> from htmldate import find_date
>>> from lxml import html
>>> mytree = html.fromstring('<html><body><span class="entry-date">July 12th, 2016</span></body></html>')
>>> find_date(mytree)

An external module can be used for download, as described in versions anterior to 0.3. This example uses the legacy mode with requests as external module.

>>> from htmldate.core import find_date
# using requests
>>> import requests
>>> r = requests.get('https://creativecommons.org/about/')
>>> find_date(r.text)
'2017-11-28' # may have changed since
# using htmldate's own fetch_url function
>>> from htmldate.utils import fetch_url
>>> htmldoc = fetch_url('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/')
>>> find_date(htmldoc)
# or simply
>>> find_date('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/') # URL detected

Date format

The output format of the dates found can be set in a format known to Python’s datetime module, the default being %Y-%m-%d:

>>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
'18 November 2016' # may have changed since

Validate the output format in the settings

Original date

Although the time delta between the original publication and the “last modified” statement is usually a matter of hours or days at most, it can be useful in some contexts to prioritize the original publication date during extraction:

>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/') # default setting
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) # modified behavior


See settings.py file:

Listing a series of settings that are applied module-wide.

The module can then be re-compiled locally to apply changes to the settings.