Earn $100/GiB with Magic's data bounty
At Magic, we are creating an AI software engineer, and we need data to train this software engineer. We have an internal data-gathering team but are also interested in purchasing data. We will pay $100/GiB for text data that meets our standards.
How we will pay you
If you find that much, $100/GiB means up to $1,000,000 (for 10 TiB). We are only interested in datasets of at least 10 GiB. We will accept smaller datasets for exceptionally high-quality data and will pay more at our discretion. Please reach out with a sample of your data or an idea for a data source to ask if it may qualify.
Evaluating dataset size
We only want good data we do not already have, so we will filter submitted data and dedupe it against what we have internally and only pay for the amount after this. We will only use the data remaining after this filtering. For example, if we find that 90% of a 500GiB dataset is already present in our collection, we will only pay you for 50GiB and we will not use the filtered 450GiB for any of our models. If we reject your data, we will not use it. We are setting up this form to advise you about whether we will accept your data.
What we are looking for
We already have data from many sources, including Github, Reddit, Common Crawl, Stack Overflow, Patents, Usenet, Wikipedia, Arxiv etc., and we will reject submissions from these sources, except for cleaned, high-quality scrapes of websites included in Common Crawl (e.g., forums). If you submit data that is a mess (raw HTML included, cut-off sentences, garbled text from OCR, gaps like you get in PDF text parsing, etc.) or highly duplicated with data we already have, we will reject it, so please clean your data! We are most interested in data related to programming (code, issues, discussions, etc.) but also in all other kinds of text data. The more "STEM-y" the data is, the better (e.g., scientific papers, worksheets).
We want the code that reproduces the dataset
Unless you're offering an existing proprietary dataset you own, we would require the code you used to collect and clean the data as part of the deliverable, so we can reproduce the scrape and customize the code to our needs. We will not send this code to anybody else. When we refer to data here, we are referring to raw text. We only pay for non-text sources if we can turn them into text (e.g., OCR, transcription) and count data size on the resulting text data.
As an example,
here is a random Reddit post. We want something like
this, which we manually generated. We stripped HTML and date/points; converted comments into a tree format; stripped flairs; removed comment trees below 1. These are arbitrary choices, and many others would be reasonable here, for example, masking out usernames, only taking the first comment tree, etc. Ultimately these depend intimately on the precise dataset you are working with. The guiding principle is that all the text should be human, it should be concise, and the formatting should be reasonably raw, without boilerplate.
If you do well on this, we would also be interested in contracting with you or hiring you for our data team.
All must be legal
We care about copyright and legality of data procurement, and we will ask questions to understand where your data comes from to ensure it follows our ethical and legal standards.
If you want to submit a dataset, please fill in the form below to ask if we already have it. We'll email you if we like your idea and don't yet have the dataset!