Quick tests using the Python interpreter

One thing I love about Python is how it can be used to very quickly prototype or try stuff thanks to its interactive interpreter, also often called REPL.¹ In this article, I'll show you how I use it to run quick tests and verify assumptions.

The other day I wanted to write a little helper script to retrieve data from my Mastodon account. Specifically, I wanted to retrieve my recent posts and parse them to extract a list of all the URLs they contain.

Mastodon provides an API that returns JSON data. I had a look at the API documentation, and got to work.

So, let's say I would like to retrieve my latests posts on Mastodon. The API mentions the statuses endpoint which requires the user id. To find the user id, I can use the lookup endpoint.

Now, the other thing I love about Python is that it comes with "batteries included". Most of the time, for quick stuff, I don't need extra libraries.

To retrieve something from the Web, I can use the urllib.request module, and to work with JSON data, I can use the json module:

In [1]: import urllib.request

In [2]: import json

In [3]: with urllib.request.urlopen("https://floss.social/api/v1/accounts/lookup?acct=pieq") as response:
   ...:     account_data = json.load(response)
   ...:

Et voilà! My account information has been stored in the account_data dictionary. I can use the pretty printer module to display it nicely:³

In [4]: from pprint import pprint

In [5]: pprint(account_data)
{'acct': 'pieq',
 (…)
 'bot': False,
 'created_at': '2022-11-10T00:00:00.000Z',
 (…)
 'id': '109318008294543295',
 (…)
 'username': 'pieq'}

And so, I can retrieve my account id like this:

In [6]: account_data["id"]
Out[6]: '109318008294543295'

Now that I know my user id, I can retrieve the latest posts from Mastodon:

In [7]: with urllib.request.urlopen("https://floss.social/api/v1/accounts/109318008294543295/statuses?exclude_replies=1") as response:
   ...:     statuses = json.load(response)
   ...: 

In [8]: len(statuses)
Out[8]: 20

I see that the latest 20 statuses have been retrieved, as explained in the API documentation (len() returns the length of an object).

I can check the content of the first status in the list like that:

In [9]: statuses[0]
Out[9]: 
{'id': '110016084347354747',
 'created_at': '2023-03-13T12:57:04.541Z',
 (…)
 'content': '<p>J&#39;écoute une des dernières émissions de Libre À Vous consacrée à la bande dessinée et la culture libre :</p><p><a href="https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt" target="_blank" rel="nofollow noopener noreferrer"><span class="invisible">https://www.</span><span class="ellipsis">libreavous.org/169-bd-et-cultu</span><span class="invisible">re-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt</span></a></p><p>Je decouvre, entre autres choses, le parcours de <span class="h-card"><a href="https://framapiaf.org/@davidrevoy" class="u-url mention">@<span>davidrevoy</span></a></span>. C&#39;est passionnant ! Je n&#39;avais jamais fait le lien entre Pepper &amp; Carrot et Sintel ! :blender:</p>',
 (…)}

(Note that I could get the last status in the list by using a negative index: statuses[-1]. Pretty cool!)

Now, what I wanted to do was to parse the content of each status and extract the links that are present.

I can extract the content from the latest status like this:

In [10]: statuses[0]["content"]
Out[10]: '<p>J&#39;écoute une des dernières émissions de Libre À Vous consacrée à la bande dessinée et la culture libre :</p><p><a href="https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt" target="_blank" rel="nofollow noopener noreferrer"><span class="invisible">https://www.</span><span class="ellipsis">libreavous.org/169-bd-et-cultu</span><span class="invisible">re-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt</span></a></p><p>Je decouvre, entre autres choses, le parcours de <span class="h-card"><a href="https://framapiaf.org/@davidrevoy" class="u-url mention">@<span>davidrevoy</span></a></span>. C&#39;est passionnant ! Je n&#39;avais jamais fait le lien entre Pepper &amp; Carrot et Sintel ! :blender:</p>'

As you can see, the content is a fragment of HTML. Parsing HTML can be quite a challenge, but fortunately, the BeautifulSoup library can help with that. I can install it² and get to work. I want to:

parse the HTML fragment,
find all the links,
extract their URLs.

In [11]: from bs4 import BeautifulSoup

In [12]: status_html = BeautifulSoup(statuses[0]["content"], "html.parser")

In [13]: links = status_html.find_all("a")

In [14]: links
Out[14]: 
[<a href="https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt" rel="nofollow noopener noreferrer" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">libreavous.org/169-bd-et-cultu</span><span class="invisible">re-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt</span></a>,
 <a class="u-url mention" href="https://framapiaf.org/@davidrevoy">@<span>davidrevoy</span></a>]

In [15]: [link.get("href") for link in links]
Out[15]: 
['https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt',
 'https://framapiaf.org/@davidrevoy']

The last bit is a feature of the Python language called list comprehension, and it's really useful to massage or filter data from one list into another.

And here it is: in a few lines of code in Python interactive interpreter, I was able to get what I wanted. I can now put that into functions and store them into a script!⁴

One last thing. When you don't know what functions, methods or attributes are available for a given object, you can call the dir() function:

In [16]: dir(status_html)
Out[16]: 
[(…)
 'find_all',
 'find_all_next',
 'find_all_previous',
 'find_next',
(…)
 'text',
 'unwrap',
 'wrap']

And to get some explanation about one of these, you can use help() (or, in IPython, just call the function with a ? at the end):

In [17]: help(status_html.find_all)

In [18]: status_html.find_all?
Signature:
status_html.find_all(
    name=None,
    attrs={},
    recursive=True,
    text=None,
    limit=None,
    **kwargs,
)
Docstring:
Look in the children of this PageElement and find all
PageElements that match the given criteria.

All find_* methods take a common set of arguments. See the online
documentation for detailed explanations.

:param name: A filter on tag name.
:param attrs: A dictionary of filters on attribute values.
:param recursive: If this is True, find_all() will perform a
    recursive search of this PageElement's children. Otherwise,
    only the direct children will be considered.
:param limit: Stop looking after finding this many results.
:kwargs: A dictionary of filters on attribute values.
:return: A ResultSet of PageElements.
:rtype: bs4.element.ResultSet
File:      /usr/lib/python3/dist-packages/bs4/element.py
Type:      method

If you want to know more about the Python REPL, check out Real Python's great article about it.

Happy Pythoning!

I find the default REPL a bit sparse, so I tend to use IPython. ↩
On my Ubuntu Linux machine, it's actually installed by default, but even if it was not, it's an apt install python3-bs4 away. ↩
IPython actually does that for you automatically, but it's always good to know about the pprint module! ↩
With IPython, you can use the history command to get a nice list of all the lines you've entered. Very handy to quickly copy-paste them! ↩