How do you work with the output from this search client? #116

jimmoffitt · 2021-03-02T04:10:27Z

jimmoffitt
Mar 2, 2021
Maintainer

We are discussing how to handle client output... With v2 JSON, the search payloads are resigned with a core "data" array containing the Tweets that matched the query, along with an independent "includes" array that contains User objects, referenced Tweets (e.g. original Tweets for Quote Tweets and Retweets), along with other supporting Twitter objects (e.g. media, polls, places).

Before v2, the search payloads provided "atomic" Tweets, with all supporting object attributes inside the Tweet object.

We are working on a new "atomic" option to have the v2 client output atomic Tweet objects and doing the work of referencing associated "includes" objects and injecting them into the "data" Tweets... I think that would be cool. Two of us (thanks Igor!) are experimenting with this and should have an update here soon (?).

It would also be good to have the client just output the response as received from the search endpoint. Seems simple enough.

So, how do you work with the client output, and what new tricks would make your integrations easier?

I hope to plug this thing in as a "database loader", so maybe that will result in more built-in output options.

igorbrigadir · 2021-03-04T17:09:48Z

igorbrigadir
Mar 4, 2021

Here's how it's done in twarc: https://github.com/DocNow/twarc/blob/v2/twarc/expansions.py (it's almost identical to #112 because i added it there too)

From another commit message that explains the reasoning:

This commit extracts the flatten option from sample and moves it into a
separate command that twarc users could run on their data. This is to
encourage people to collect the original data wherever possible.

So where previously where you would have done this:

    twarc2 sample --flatten > sample.jsonl

You will now do this:

    twarc2 sample > sample.jsonl
    twarc2 flatten sample.jsonl > sample-flattened.jsonl

Or, if you *really* don't want the original JSON, you can create a
pipeline:

    twarc2 sample | twarc2 flatten > sample.jsonl

So in #112 atomic format is identical to flattened in twarc parlance. In twarc, the idea is to store the original requests (r variety in #112 ) and then optionally post process as a.

After the tweets are retrieved, they usually get stored as is in jsonl one json object per line, gzipped. These are loaded into R or python for analysis later - usually as CSVs, flattened into a dataframe with pandas json reader or something like that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How do you work with the output from this search client? #116

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

TMZ Celebrity News – Breaking Stories, Videos & Gossip

🎥 Watch TMZ Live

How do you work with the output from this search client? #116

Uh oh!

jimmoffitt Mar 2, 2021 Maintainer

Replies: 1 comment

Uh oh!

igorbrigadir Mar 4, 2021

TMZ Celebrity News – Breaking Stories, Videos & Gossip

🎥 Watch TMZ Live

jimmoffitt
Mar 2, 2021
Maintainer

igorbrigadir
Mar 4, 2021