/sci/ - Science & Math » Thread #14076287

30KiB, 640x480, D_4IKZhXkAExJ_f.jpg

View Same Google iqdb SauceNAO

Data Science question - Time series analysis

Anonymous Sun 09 Jan 01:27:56 2022 No.14076287 View Reply Original Report

Quoted By: >>14076296 >>14076470

Hey all,
I'm sincerely hoping someone might be able to help me figure out the steps to take to perform some Time Series-type Analyses on this dataset I have.

I've collected the 'audio features' from Spotify's API for the Top 200 songs for every day since the start of 2018, in like 66 different countries. (The 'audio features' include Energy, Danceability, Instrumentalness, Valence, etc.)
I've also added in each country's daily reported # of new cases of Covid & # of daily deaths from Covid.

This makes a >16million row dataset I have at the moment.

My goal is to analyze the fluctuations of each country's top songs, and determine whether there have been any significant changes in their usual values. I'd also like to show whether the amount of Covid cases/deaths has any relationship with the ebbs & flows in that country's Audio Features.

However, I'm having trouble figuring out how to go about determining any of those relationships – starting I guess with how to wrangle the shape of this dataset.
I guess it's not a straightforward Time Series, nor does it seem quite like it's "Panel Data"....with those, isn't it 1 or multiple subject(s) with its multiple observations coming from different dates over time? Whereas with this, it is looking at multiple countries' observations over time, but also each *day* itself has 200 observations within it.

It's 200 rows every day, for every country as well. I could take a first step by just finding the average of, say, the 'Energy' of all songs in a country's daily Top 200, but then I lose information about that day's complete spread of values... so I assume this must be where Vector Regression can somehow be used?

I'm not great at this yet. Thank y'all for even reading this far!

tl;dr – How do I find the relationships between a multivariable time series that has 200 daily observations for each country over 4 years, and also how those Series might relate to # of Covid cases/deaths?

Anonymous

Anonymous Sun 09 Jan 2022 01:29:12 No.14076296 Report

Quoted By:

>>14076287
{[(Forgot to mention that I'm primarily using Python, though I've also got R available. Visualizations will be made later in Tableau)]}

Anonymous

Anonymous Sun 09 Jan 2022 02:16:09 No.14076470 Report

Quoted By: >>14076903

>>14076287
You've noted that there are probably fluctuations in your data + interesting covid effects. Do a KPSS test to see if your data would fit an ARIMA model, an ARIMA model is trend + cyclic + "stationary process" and there are lots of packages that can help you do this decomposition.

The decomposition is usually performed by doing differencing on the time series plus "Box-Cox" transformations to stabilize variance. You will have to do this by hand, and the point is that you want to eventually wind up with the "stationary process" that should look like white noise. When doing a pure time series analysis this is usually the end point, but in your case that is the trend-and-cyclic-independent component of the data that would be good to correlate with your covid data to see if there's a relationship. The reason to take out the trend and cyclic terms first is that might overwhelm and covid effect you might otherwise find in your music data.

The repeated differencing and Box-Cox transformations are not hard just find a post somewhere that describes what you are looking for so you know when to stop doing them.

Anonymous

Anonymous Sun 09 Jan 2022 04:02:26 No.14076903 Report

Quoted By: >>14079431

>>14076470
Ah, thank you for these suggestions, fellow anon!
I've got a week left to do it, so I'll try my best.

Admittedly I'm still confused about the 200 observations per day. Can I ask if you'd recommend getting the median and/or mean for each country's daily chart so that each country/day combo only has 1 row for the Audio Features values (as opposed to 200 separate ones)?

And does that ARIMA+decomp process have to be done separately for each individual variable of interest (Energy, Valence, etc.) against the Covid data? Or are there models that can analyze the whole list of (potentially-)explanatory variables at once?

Finally: once I've got the data to look stable by removing the trend and cyclic components, I assume there is some simple way to compute that correlation? Does the process return a score/value for each 'day', and then that value is what I can correlate with the Covid factors?

Thanks again for the clues, though, yo!

Anonymous

Anonymous Sun 09 Jan 2022 19:49:18 No.14079431 Report

Quoted By:

>>14076903
>Admittedly I'm still confused about the 200 observations per day. Can I ask if you'd recommend getting the median and/or mean for each country's daily chart so that each country/day combo only has 1 row for the Audio Features values (as opposed to 200 separate ones)?
Ah yeah that's too bad that 200 rows are going to be reduced to 1 but I can't think of a better idea at the moment. Median is probably better for your use case but it probably doesn't matter since you're not looking at the daily median or mean but you'll be looking over very many days in which case the median and mean will usually be about the same.

>And does that ARIMA+decomp process have to be done separately for each individual variable of interest (Energy, Valence, etc.) against the Covid data? Or are there models that can analyze the whole list of (potentially-)explanatory variables at once?

I think in your case it will be more straightforward to do it with each variable separately since you don't really know in advance what will be significant. Alternatively, you could first do a Principle Component Analysis to reduce your dataset to a few principle components (maybe just one even?) which will give you less work to do.

>Finally: once I've got the data to look stable by removing the trend and cyclic components, I assume there is some simple way to compute that correlation? Does the process return a score/value for each 'day', and then that value is what I can correlate with the Covid factors?

Yes since your hypothesis is that covid influences music choice (and you've already taken out trends and seasonality in time)

>Thanks again for the clues, though, yo!

Yw anon

Anonymous

Anonymous Mon 10 Jan 2022 03:37:15 No.14081181 Report

Quoted By: >>14082071 >>14082097

Create averages on country-week basis (there will be too much noise otherwise), features like, energy quantiles, danceability quantiles etc. You should end up with smth like 20 music features. Then proceed to create time features, like week number and year. Join this with COVID numbers and simply fit the tree based algorithm like random Forrest or gradient boosting. This is easiest solution and it will produce most accurate result. If you want to make conclusions of what drives the prediction numbers, use the shap (https://github.com/slundberg/shap). Consider taking a log of COVID numbers to make all explanations and visualization more expressive. Whole thing can be whipped in the less then an hour, share a sample of dataset and I might write the whole thing for you

Anonymous

Anonymous Mon 10 Jan 2022 11:24:32 No.14082071 Report

Quoted By: >>14082097

>>14081181

This guy analyzes. Do what he says.

Only thing I would add is to do thorough "summary stats" on all the quantiles and each month and so on. People like to look at easy averages and standard deviations and it gives you the basis for all your stories.

Possibly try some one week / month ahead lag structures for additional funny relationships.

I would personally also run just a basic linear regression on the data explaining one feature with the others, because that gives you a baseline result to compare other more "complex" results to.

Anonymous

Anonymous Mon 10 Jan 2022 11:25:02 No.14082072 Report

Quoted By: >>14082097

Sounds like an interesting route, what exactly would the RandomForest model be determining here, though?

Thanks for sharing your thoughts. Here I've uploaded a sample of the dataset with just a few countries' top charts each day (as well as Spotify's region for worldwide, here labelled 'Global'), if you'd be willing to try it out.

https://mega.nz/file/xwpGQagK#h_YV_Gt1Rnu55em8iYQofatLSyeViJ-zQsHOwuuUrEc

Anonymous

Anonymous Mon 10 Jan 2022 11:30:42 No.14082083 Report

Quoted By: >>14082097

What would happen if you take a huge dataset of midi files of pop music and analyzed the melodies and chords? I figure someones already done this and combined it with AI to write music for them.

Anonymous

Anonymous Mon 10 Jan 2022 11:34:58 No.14082097 Report

Quoted By: >>14082309

>>14082072

Forgot to tag >>14081181 for the reply.

>>14082071
Thanks for the heads-up about summary stats; that's one of the only bits I've got done so far. I've got daily, monthly, and yearly summaries per Region, just to have options depending on how granular I want to go. Although I guess I should grab Weekly ones if that's recommended.
Might you suggest starting with a Backwards Regression for the Features, which would hopefully whittle down to which are the most explanatory for each other?

>>14082083
That would be awesome to check out! Great idea.

Anonymous

Anonymous Mon 10 Jan 2022 12:42:33 No.14082309 Report

Quoted By:

>>14082097

I personally wouldn't probably use a "method" to mechanically cut down the list of variables. Instead I would think about a story and then run that regression.

(this might be dumb, but) So like, if I want to show that the increase in covid cases or deaths in one week / month is related to how much sad music is being listened to in the next week / month. Then that's the regression I would run, and then adding country dummies.

What will probably happen is that using all of the data at once will be really noisy and there will be no clear relationships. So then you might have to look at a subsample, or aggregate into bigger regions than countries, or use language area dummies rather than country dummies or something else.

Hopefully you can see a story from the summary stats so then the regression will just be used to confirm it in a multivariate setting.

Capcode	All Only User Posts Only Moderator Posts Only Admin Posts Only Developer Posts
Show Posts	All Only With Images Only Without Images
Deleted Posts	All Only Deleted Posts Only Non-Deleted Posts
Ghost Posts	All Only Ghost Posts Only Non-Ghost Posts
Post Type	All Only Sticky Threads Only Opening Posts Only Reply Posts
Results	All Grouped By Threads
Order	Latest Posts First Oldest Posts First

Your latest searches