Quick tests using the Python interpreter

Published:

One thing I love about Python is how it can be used to very quickly prototype or try stuff thanks to its interactive interpreter, also often called REPL.1 In this article, I'll show you how I use it to run quick tests and verify assumptions.

The other day I wanted to write a little helper script to retrieve data from my Mastodon account. Specifically, I wanted to retrieve my recent posts and parse them to extract a list of all the URLs they contain.

Mastodon provides an API that returns JSON data. I had a look at the API documentation, and got to work.

So, let's say I would like to retrieve my latests posts on Mastodon. The API mentions the statuses endpoint which requires the user id. To find the user id, I can use the lookup endpoint.

Now, the other thing I love about Python is that it comes with "batteries included". Most of the time, for quick stuff, I don't need extra libraries.

To retrieve something from the Web, I can use the urllib.request module, and to work with JSON data, I can use the json module:

In [1]: import urllib.request

In [2]: import json

In [3]: with urllib.request.urlopen("https://floss.social/api/v1/accounts/lookup?acct=pieq") as response:
   ...:     account_data = json.load(response)
   ...: 

Et voilà! My account information has been stored in the account_data dictionary. I can use the pretty printer module to display it nicely:3

In [4]: from pprint import pprint

In [5]: pprint(account_data)
{'acct': 'pieq',
 ()
 'bot': False,
 'created_at': '2022-11-10T00:00:00.000Z',
 ()
 'id': '109318008294543295',
 ()
 'username': 'pieq'}

And so, I can retrieve my account id like this:

In [6]: account_data["id"]
Out[6]: '109318008294543295'

Now that I know my user id, I can retrieve the latest posts from Mastodon:

In [7]: with urllib.request.urlopen("https://floss.social/api/v1/accounts/109318008294543295/statuses?exclude_replies=1") as response:
   ...:     statuses = json.load(response)
   ...: 

In [8]: len(statuses)
Out[8]: 20

I see that the latest 20 statuses have been retrieved, as explained in the API documentation (len() returns the length of an object).

I can check the content of the first status in the list like that:

In [9]: statuses[0]
Out[9]: 
{'id': '110016084347354747',
 'created_at': '2023-03-13T12:57:04.541Z',
 (…)
 'content': '<p>J&#39;écoute une des dernières émissions de Libre À Vous consacrée à la bande dessinée et la culture libre :</p><p><a href="https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt" target="_blank" rel="nofollow noopener noreferrer"><span class="invisible">https://www.</span><span class="ellipsis">libreavous.org/169-bd-et-cultu</span><span class="invisible">re-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt</span></a></p><p>Je decouvre, entre autres choses, le parcours de <span class="h-card"><a href="https://framapiaf.org/@davidrevoy" class="u-url mention">@<span>davidrevoy</span></a></span>. C&#39;est passionnant ! Je n&#39;avais jamais fait le lien entre Pepper &amp; Carrot et Sintel ! :blender:</p>',
 (…)}

(Note that I could get the last status in the list by using a negative index: statuses[-1]. Pretty cool!)

Now, what I wanted to do was to parse the content of each status and extract the links that are present.

I can extract the content from the latest status like this:

In [10]: statuses[0]["content"]
Out[10]: '<p>J&#39;écoute une des dernières émissions de Libre À Vous consacrée à la bande dessinée et la culture libre :</p><p><a href="https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt" target="_blank" rel="nofollow noopener noreferrer"><span class="invisible">https://www.</span><span class="ellipsis">libreavous.org/169-bd-et-cultu</span><span class="invisible">re-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt</span></a></p><p>Je decouvre, entre autres choses, le parcours de <span class="h-card"><a href="https://framapiaf.org/@davidrevoy" class="u-url mention">@<span>davidrevoy</span></a></span>. C&#39;est passionnant ! Je n&#39;avais jamais fait le lien entre Pepper &amp; Carrot et Sintel ! :blender:</p>'

As you can see, the content is a fragment of HTML. Parsing HTML can be quite a challenge, but fortunately, the BeautifulSoup library can help with that. I can install it2 and get to work. I want to:

In [11]: from bs4 import BeautifulSoup

In [12]: status_html = BeautifulSoup(statuses[0]["content"], "html.parser")

In [13]: links = status_html.find_all("a")

In [14]: links
Out[14]: 
[<a href="https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt" rel="nofollow noopener noreferrer" target="_blank"><span class="invisible">https://www.</span><span class="ellipsis">libreavous.org/169-bd-et-cultu</span><span class="invisible">re-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt</span></a>,
 <a class="u-url mention" href="https://framapiaf.org/@davidrevoy">@<span>davidrevoy</span></a>]

In [15]: [link.get("href") for link in links]
Out[15]: 
['https://www.libreavous.org/169-bd-et-culture-libre-chatgpt-dans-l-eau-pituite-de-luk-sur-chatgpt',
 'https://framapiaf.org/@davidrevoy']

The last bit is a feature of the Python language called list comprehension, and it's really useful to massage or filter data from one list into another.

And here it is: in a few lines of code in Python interactive interpreter, I was able to get what I wanted. I can now put that into functions and store them into a script!4

One last thing. When you don't know what functions, methods or attributes are available for a given object, you can call the dir() function:

In [16]: dir(status_html)
Out[16]: 
[(…)
 'find_all',
 'find_all_next',
 'find_all_previous',
 'find_next',
(…)
 'text',
 'unwrap',
 'wrap']

And to get some explanation about one of these, you can use help() (or, in IPython, just call the function with a ? at the end):

In [17]: help(status_html.find_all)

In [18]: status_html.find_all?
Signature:
status_html.find_all(
    name=None,
    attrs={},
    recursive=True,
    text=None,
    limit=None,
    **kwargs,
)
Docstring:
Look in the children of this PageElement and find all
PageElements that match the given criteria.

All find_* methods take a common set of arguments. See the online
documentation for detailed explanations.

:param name: A filter on tag name.
:param attrs: A dictionary of filters on attribute values.
:param recursive: If this is True, find_all() will perform a
    recursive search of this PageElement's children. Otherwise,
    only the direct children will be considered.
:param limit: Stop looking after finding this many results.
:kwargs: A dictionary of filters on attribute values.
:return: A ResultSet of PageElements.
:rtype: bs4.element.ResultSet
File:      /usr/lib/python3/dist-packages/bs4/element.py
Type:      method

If you want to know more about the Python REPL, check out Real Python's great article about it.

Happy Pythoning!


  1. I find the default REPL a bit sparse, so I tend to use IPython

  2. On my Ubuntu Linux machine, it's actually installed by default, but even if it was not, it's an apt install python3-bs4 away. 

  3. IPython actually does that for you automatically, but it's always good to know about the pprint module! 

  4. With IPython, you can use the history command to get a nice list of all the lines you've entered. Very handy to quickly copy-paste them!