One thing I love about Python is how it can be used to very quickly prototype or try stuff thanks to its interactive interpreter, also often called REPL.1 In this article, I'll show you how I use it to run quick tests and verify assumptions.
The other day I wanted to write a little helper script to retrieve data from my Mastodon account. Specifically, I wanted to retrieve my recent posts and parse them to extract a list of all the URLs they contain.
Mastodon provides an API that returns JSON data. I had a look at the API documentation, and got to work.
So, let's say I would like to retrieve my latests posts on Mastodon. The API
mentions the statuses
endpoint which requires the user id
. To find the
user id
, I can use the lookup
endpoint.
Now, the other thing I love about Python is that it comes with "batteries included". Most of the time, for quick stuff, I don't need extra libraries.
To retrieve something from the Web, I can use the urllib.request
module,
and to work with JSON data, I can use the json
module:
In [1]: import urllib.request
In [2]: import json
In [3]: with urllib.request.urlopen("https://floss.social/api/v1/accounts/lookup?acct=pieq") as response:
...: account_data = json.load(response)
...:
Et voilà! My account information has been stored in the account_data
dictionary. I can use the pretty printer module to display it nicely:3
In [4]: from pprint import pprint
In [5]: pprint(account_data)
{'acct': 'pieq',
(…)
'bot': False,
'created_at': '2022-11-10T00:00:00.000Z',
(…)
'id': '109318008294543295',
(…)
'username': 'pieq'}
And so, I can retrieve my account id
like this:
In [6]: account_data["id"]
Out[6]: '109318008294543295'
Now that I know my user id
, I can retrieve the latest posts from Mastodon:
In [7]: with urllib.request.urlopen("https://floss.social/api/v1/accounts/109318008294543295/statuses?exclude_replies=1") as response:
...: statuses = json.load(response)
...:
In [8]: len(statuses)
Out[8]: 20
I see that the latest 20 statuses have been retrieved, as explained in the API
documentation (len()
returns the length of an object).
I can check the content of the first status in the list like that:
In [9]: statuses[0]
Out[9]:
{'id': '110016084347354747',
'created_at': '2023-03-13T12:57:04.541Z',
(…)
'content': '<p>J'écoute une des dernières émissions de Libre À Vous consacrée à la bande dessinée et la culture libre :</p><p><a href="https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt" target="_blank" rel="nofollow noopener noreferrer"><span class="invisible">https://www.</span><span class="ellipsis">libreavous.org/169-bd-et-cultu</span><span class="invisible">re-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt</span></a></p><p>Je decouvre, entre autres choses, le parcours de <span class="h-card"><a href="https://framapiaf.org/@davidrevoy" class="u-url mention">@<span>davidrevoy</span></a></span>. C'est passionnant ! Je n'avais jamais fait le lien entre Pepper & Carrot et Sintel ! :blender:</p>',
(…)}
(Note that I could get the last status in the list by using a negative
index: statuses[-1]
. Pretty cool!)
Now, what I wanted to do was to parse the content of each status and extract the links that are present.
I can extract the content from the latest status like this:
In [10]: statuses[0]["content"]
Out[10]: '<p>J'écoute une des dernières émissions de Libre À Vous consacrée à la bande dessinée et la culture libre :</p><p><a href="https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt" target="_blank" rel="nofollow noopener noreferrer"><span class="invisible">https://www.</span><span class="ellipsis">libreavous.org/169-bd-et-cultu</span><span class="invisible">re-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt</span></a></p><p>Je decouvre, entre autres choses, le parcours de <span class="h-card"><a href="https://framapiaf.org/@davidrevoy" class="u-url mention">@<span>davidrevoy</span></a></span>. C'est passionnant ! Je n'avais jamais fait le lien entre Pepper & Carrot et Sintel ! :blender:</p>'
As you can see, the content is a fragment of HTML. Parsing HTML can be quite a challenge, but fortunately, the BeautifulSoup library can help with that. I can install it2 and get to work. I want to:
- parse the HTML fragment,
- find all the links,
- extract their URLs.
In [11]: from bs4 import BeautifulSoup
In [12]: status_html = BeautifulSoup(statuses[0]["content"], "html.parser")
In [13]: links = status_html.find_all("a")
In [14]: links
Out[14]:
[<a href="https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt" rel="nofollow noopener noreferrer" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">libreavous.org/169-bd-et-cultu</span><span class="invisible">re-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt</span></a>,
<a class="u-url mention" href="https://framapiaf.org/@davidrevoy">@<span>davidrevoy</span></a>]
In [15]: [link.get("href") for link in links]
Out[15]:
['https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt',
'https://framapiaf.org/@davidrevoy']
The last bit is a feature of the Python language called list comprehension, and it's really useful to massage or filter data from one list into another.
And here it is: in a few lines of code in Python interactive interpreter, I was able to get what I wanted. I can now put that into functions and store them into a script!4
One last thing. When you don't know what functions, methods or attributes are
available for a given object, you can call the dir()
function:
In [16]: dir(status_html)
Out[16]:
[(…)
'find_all',
'find_all_next',
'find_all_previous',
'find_next',
(…)
'text',
'unwrap',
'wrap']
And to get some explanation about one of these, you can use help()
(or,
in IPython, just call the function with a ?
at the end):
In [17]: help(status_html.find_all)
In [18]: status_html.find_all?
Signature:
status_html.find_all(
name=None,
attrs={},
recursive=True,
text=None,
limit=None,
**kwargs,
)
Docstring:
Look in the children of this PageElement and find all
PageElements that match the given criteria.
All find_* methods take a common set of arguments. See the online
documentation for detailed explanations.
:param name: A filter on tag name.
:param attrs: A dictionary of filters on attribute values.
:param recursive: If this is True, find_all() will perform a
recursive search of this PageElement's children. Otherwise,
only the direct children will be considered.
:param limit: Stop looking after finding this many results.
:kwargs: A dictionary of filters on attribute values.
:return: A ResultSet of PageElements.
:rtype: bs4.element.ResultSet
File: /usr/lib/python3/dist-packages/bs4/element.py
Type: method
If you want to know more about the Python REPL, check out Real Python's great article about it.
Happy Pythoning!
-
I find the default REPL a bit sparse, so I tend to use IPython. ↩
-
On my Ubuntu Linux machine, it's actually installed by default, but even if it was not, it's an
apt install python3-bs4
away. ↩ -
IPython actually does that for you automatically, but it's always good to know about the
pprint
module! ↩ -
With IPython, you can use the
history
command to get a nice list of all the lines you've entered. Very handy to quickly copy-paste them! ↩