/sci/ - Science & Math » Thread #13791245

73KiB, 1332x354, weirdmodel.jpg

View Same Google iqdb SauceNAO

Anonymous Tue 26 Oct 2021 08:54:17 No.13791245 View Reply Original Report

Quoted By: >>13791508 >>13791577

Hello everyone (brainlet boomer /sp/ tourist here). I am working on a little data mining project. It's just a hobby I do in order to improve my few programming skills, so I am not any kind of proffesional. I know this is not stackoverflow, but maybe you can help me to choose a proper model for my data.

So, let's say we have a table with three columns (pic related):
-Year: Numeric type column
-Color: String type column
-Name: String type column

In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).

My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)

Which do you think it could be a good approach to this task? I have illustrated the process in my pic. Sorry for the low IQ paint drawing, but I tried to make it clear. Probably you find my question a bit dumb, but I appreciate any help.

Anonymous

Anonymous Tue 26 Oct 2021 12:02:34 No.13791508 Report

Quoted By: >>13791653

>>13791245
bump because i wish i knew how to help

Anonymous

Anonymous Tue 26 Oct 2021 12:38:32 No.13791577 Report

Quoted By: >>13791653

>>13791245

how you would solve this problem would depend a lot on the kind of environment you are working with

for example you could use something like boxplot by group in r which is a free, open source software program for statistical analysis,

https://r-charts.com/distribution/box-plot-group/

you could do a similar thing in python with seaborn

https://seaborn.pydata.org/tutorial/categorical.html

Anonymous

Anonymous Tue 26 Oct 2021 13:21:15 No.13791653 Report

Quoted By: >>13791770

>>13791577

Thanks, based Anon, although I am not sure I explained myself clearly.
I use both R and Python. The point has to do with the statistical extrapolation of my sample to the whole table.

It is to say. My table has 4000 rows. Around 1000 of them have a value in NAME variable. The rest of them only have NaN. But I would like to know how many of these 3000 missing values are David, how many are Chads, etc.

I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):

For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".

When the above process ends, the number of occurrences for each name is counted.

The above process is applied 30 times and the results for each name are displayed in a plotbox.

The point is that my statistics background is very limited, so I am not sure if there is a more professional way of doing this or a widely known algorythm useful in this kind of situations.

>>13791508
Thank you too

Anonymous

Anonymous Tue 26 Oct 2021 14:19:08 No.13791770 Report

Quoted By:

>>13791653
this is just a guess but,

if i had a table/matrix where the third column contained a series "NaN" value that I wanted to replace with something else, and assuming I am not able to use any in-built functions for whatever I am using to plot it, I would write a for loop that searched through that column looking for each "NaN" value and replacing it with the value i wanted to replace it with.

the particular way you would do this would vary by programming environment (e.g. in matlab a vectorising approach would likely be faster and more efficient than a for loop).

to figure out how you would allocate the new values, without using any in built functions, you would want to loop the whole table and built up a list of all the possible combinations of YEAR, COLOR, NAME and count the number of them. from that count you would be able to come up with a proportion of them occurring.

lets say {1900, Green, David} occurs 5 times and {1900, Yellow, Sarah} occurs 5 times and thats all that occurs.

then {1900, Green, David} is 5/10, and {1900, Yellow, Sarah} is 5/10

what you could then do is write a function that generates a random number between 0 and 1

if the number falls within the range [0-0.5) then it will replace NaN with David,
if the number falls within the range [0.5-1) then it will replace NaN with Sarah,

just as long as the range you assign to each combination of {YEAR, COLOR, NAME} coincides with the proportion of those {YEAR, COLOR, NAME} it should be fine.

however, if it was me and I was given that data set I would just delete those NaN values as trying to replace them might be misleading and doesn't really seem to be adding much more information as it's not real data.

i also highly suspect that there would be some kind of in built function in whatever package you are using that could do a lot of this rather than programming it yourself.

Capcode	All Only User Posts Only Moderator Posts Only Admin Posts Only Developer Posts
Show Posts	All Only With Images Only Without Images
Deleted Posts	All Only Deleted Posts Only Non-Deleted Posts
Ghost Posts	All Only Ghost Posts Only Non-Ghost Posts
Post Type	All Only Sticky Threads Only Opening Posts Only Reply Posts
Results	All Grouped By Threads
Order	Latest Posts First Oldest Posts First

Your latest searches