[Kalapa Credit Challenge] #3 prize (with Code)

kalapa

#1

Hello everyone,

I would write in English because I don’t know the Vietnamese language. I hope you will understand my meaning.

Shortly about me

  • My name is Vlad and I’m from Russia
  • 5 years in Vietnam - CocCoc & now in FPT.AI with excellent colleagues (I learn a lot from them :))
  • Kaggle Competitions Grandmaster (Top 111 of 134’912)
  • I’m lazy and it’s the reason why I love optimization/automatization/AI because it’s to save a lot of my time, even if it takes all my time =))
  • 2 years ago I join a similar competition Porto Seguro’s Safe Driver Prediction and even similar percentile of rank:
	30/5163 ~ Top 0.5%
	 3/847  ~ Top 0.3%
  • I have got this competition too easy(without complicated data or deep architectures), for a bit good predictions just need fit() and then predict() :slight_smile: ( in FPT.AI we use much better and complicated model for the FPT credit scoring solution)

Solution

  • You can read my code
  • My solution similar to AutoML, what actually good for competitions where hided meaning of data, or if you don’t wanna spend a lot of personal time…
  • Firstly, I found that train and test not similar, even via simple test like that:
train['label'] = 1
test['label'] = 0
data = concat(train, test)
CV(data) # This label must be unpredictable, but not in this competition
  • After that, I have removed F36 & F37
  • Most of new features its combinations of simple math operations between of common features like that:
for l, r in combinations(auto_columns, 2):
    for func in 'add subtract divide multiply'.split():
        df[f'auto_{func}_{l}_{r}'] = getattr(np, func)(df[l], df[r])
  • I should have to clean after that most of useless features, but I left competition too fast
  • I feel free to submit my predictions to LB just for free checks against deep mistakes, but actually I trust only my personal validations!

Conclusion

Public     0.25
Validation 0.26
Private    0.27

In this competition, I have got 3rd place with a bit of luck and because most of the competitors have been overfitted on the LB, as I guess. I would glad if my prize(1M VND) would be sent to the Covid Foundation


[04/17/2020 23:49] Một anh bạn người Nga có điểm private cao thứ . . .
#2

The simple but significant solution. I’ve same ideal with you when processing field 7 feature. That i don’t see anyone discussing before. Only a question. What’s about random state ideal? Why you choose random_state=3462873?


#4

Thank you. You are not 1st who have asked me about that =)) But it’s just a random digit for the random state. Actually it’s important only for validations or if you wanna repeat results. Sometimes I have a constant array of random states for more safe validations


#5

Congratulation on your 3th position! Can you explain your point of view when transforming FIELD_3? How can you came up with it?


#6

Thank you. It was several months ago, as I remember:

  • I got that when I looked at the histogram FIELD_3
  • You can get average length season if run this code:
    values = df['FIELD_3'].sort_values().values
    diffs = np.diff(values)
    np.diff(values[np.where(diffs > 10)]).mean() # ~365.3
  • I haven’t found correlation between days in year with label
  • If information useless(as noise) I would remove it - it’s good for training models

#7

Thank you Vlad, Before this competition, I don’t actually believe in auto feature generation, but now I’ve changed my mind :slight_smile: Any way, I don’t understand this code, what does it do ? And what is “CV” ?


#8

Thank you! It is abstract code, where I have tried to predict location(test/train) for each row via CV(cross-validation). And in this competition it was predictable.


#9

That’s a nice idea. I wonder why you did that step. What if train & test sets were similar, what if they were not ?


#10

If similar that ok, if not that need to figure out why or would overfit on train