Grab the data and run (the code)!
No data science website is complete without some outgoing links to the most famous repositories of DATA!
Whenever you feel ready to dive into the real world the first thing you should do is grab the data and run (the code)!
Where? I am here for you:
Kaggle: a very big team behind this project that helped me a lot during my development as a code freebooter. Lots of real world datasets (with filters for specific fields) and a bunch of competitions you can try or use to be inspired by copying others’ implementations. Yes, really, copy the code, we do that all the time.
UCI Machine Learning Repository: this is precious. Data here is less dynamic and more “databasy” but the documentation and references to relevant papers makes this site a must if you really want to delve into a problem.
Google dataset search: did you know about that? Well, honestly neither did I. If mommy Google does something, it does it right (I am genuinely frightened about the fact that it recently landed into the bioinformatics field and I’m sure it will steal all our jobs with an army of geniuses-driven-cloud TPUs)
Awesome Data on Github: This is more loosely organized. For each topic you have a list of websites containing related data. Not necessarily formatted data, not necessarily clean data, not necessarily …anything. But if one needs to complement its work with external information this place really wins the podium for its wide-range of covered topics.