Options
=======
Configuration
-------------
Input format
~~~~~~~~~~~~
The module expects strings as shown above, it is also possible to use already parsed HTML (i.e. a LXML tree object):
.. code-block:: python
>>> from htmldate import find_date
>>> from lxml import html
>>> mytree = html.fromstring('
July 12th, 2016')
>>> find_date(mytree)
'2016-07-12'
An external module can be used for download, as described in versions anterior to 0.3. This example uses the legacy mode with `requests `_ as external module.
.. code-block:: python
>>> from htmldate.core import find_date
# using requests
>>> import requests
>>> r = requests.get('https://creativecommons.org/about/')
>>> find_date(r.text)
'2017-11-28' # may have changed since
# using htmldate's own fetch_url function
>>> from htmldate.utils import fetch_url
>>> htmldoc = fetch_url('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/')
>>> find_date(htmldoc)
'2018-06-28'
# or simply
>>> find_date('https://blog.wikimedia.org/2018/06/28/interactive-maps-now-in-your-language/') # URL detected
'2018-06-28'
Date format
~~~~~~~~~~~
Change the output to a format known to Python's ``datetime`` module, the default being ``%Y-%m-%d``:
.. code-block:: python
>>> find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
'18 November 2016' # may have changed since
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html', outputformat='%Y-%m-%dT%H:%M:%S%z')
'2016-12-23T05:11:00-0500'
.. autofunction:: htmldate.validators.output_format_validator
Original date
~~~~~~~~~~~~~
Although the time delta between the original publication and the "last modified" statement is usually a matter of hours or days at most, it can be useful in some contexts to prioritize the original publication date during extraction:
.. code-block:: python
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/') # default setting
'2019-06-24'
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) # modified behavior
'2016-06-23'
Settings
--------
See ``settings.py`` file:
.. automodule:: htmldate.settings
:members:
:show-inheritance:
:undoc-members:
The module can then be re-compiled locally to apply changes to the settings.
Clearing caches
~~~~~~~~~~~~~~~
.. code-block:: python
>>> from htmldate.meta import reset_caches
# at a given point in time
>>> reset_caches()
*New in version 1.3.0.*
Tests
-----
A series of HTML pages and patterns triggering different structural and content patterns is included for testing purposes:
.. code-block:: bash
$ python3 -m pip install pytest
$ pytest