I'm analyzing some data that's at the county level. I'm going to pair this with other county level data (socioeconomic, race, etc) that's provided by the census.
I thought it'd be neat to overlay public sentiment via scraping public, geocoded tweets about the topic and doing sentiment analysis, etc.
How do I develop a sampling method that is sound for the Twitter scraping to get a county level estimate?
US has 3143 counties
Random sampling would need 343 county samples for a 95% confidence level.
That seems like a doable amount of scraping. The next step I am a bit confused on how to proceed.
How do I determine the amount of cities I need to sample within a county? Should it be the same for all counties?
How do I determine which cities to sample? Do I look up the Wikipedia of the county and choose randomly again? Or do I pick the top X most populous knowing that geolocated Twitter data might be limited in smaller cities, knowing that skews the data towards urban centres?
I understand that this isn't completely statistically sound but I'd like to try to get the best possible result within the limitations I have.
I thought it'd be neat to overlay public sentiment via scraping public, geocoded tweets about the topic and doing sentiment analysis, etc.
How do I develop a sampling method that is sound for the Twitter scraping to get a county level estimate?
US has 3143 counties
Random sampling would need 343 county samples for a 95% confidence level.
That seems like a doable amount of scraping. The next step I am a bit confused on how to proceed.
How do I determine the amount of cities I need to sample within a county? Should it be the same for all counties?
How do I determine which cities to sample? Do I look up the Wikipedia of the county and choose randomly again? Or do I pick the top X most populous knowing that geolocated Twitter data might be limited in smaller cities, knowing that skews the data towards urban centres?
I understand that this isn't completely statistically sound but I'd like to try to get the best possible result within the limitations I have.
