Large-scale Near-deduplication Behind BigCode

People who find themselves taken with document-level near-deduplication at a big scale, and have some understanding of hashing, graph and text processing.

Motivations

It is crucial to handle our data before feeding it to the model, not less than Large Language Model in our case, because the old saying goes, garbage in, garbage out. Although it’s increasingly difficult to achieve this with headline-grabbing models (or should we are saying APIs) creating an illusion that data quality matters less.

Considered one of the issues we face in each BigScience and BigCode for data quality is duplication, including possible benchmark contamination. It has been shown that models are inclined to output training data verbatim when there are a lot of duplicates[1] (though it’s less clear in another domains[2]), and it also makes the model vulnerable to privacy attacks[1]. Moreover, some typical benefits of deduplication also include:

Efficient training: You’ll be able to achieve the identical, and sometimes higher, performance with less training steps[3] [4].
Prevent possible data leakage and benchmark contamination: Non-zero duplicates discredit your evaluations and potentially make so-called improvement a false claim.
Accessibility. Most of us cannot afford to download or transfer 1000’s of gigabytes of text repeatedly, not to say training a model with it. Deduplication, for a fix-sized dataset, makes it easier to review, transfer and collaborate with.

From BigScience to BigCode

Allow me to share a story first on how I jumped on this near-deduplication quest, how the outcomes have progressed, and what lessons I actually have learned along the best way.

It began with a conversation on LinkedIn when BigScience had already began for a few months. Huu Nguyen approached me when he noticed my pet project on GitHub, asking me if I were taken with working on deduplication for BigScience. In fact, my answer was a yes, completely blind to just how much effort shall be required alone attributable to the sheer mount of the information.

It was fun and difficult at the identical time. It’s difficult in a way that I didn’t really have much research experience with that sheer scale of knowledge, and everybody was still welcoming and trusting you with 1000’s of dollars of cloud compute budget. Yes, I needed to get up from my sleep to double-check that I had turned off those machines several times. In consequence, I needed to learn on the job through all of the trials and errors, which in the long run opened me to a brand new perspective that I do not think I might ever have if it weren’t for BigScience.

Moving forward, one yr later, I’m putting what I actually have learned back into BigCode, working on even larger datasets. Along with LLMs which might be trained for English[3], we’ve got confirmed that deduplication improves code models too[4], while using a much smaller dataset. And now, I’m sharing what I actually have learned with you, my dear reader, and hopefully, you may also get a way of what is going on behind the scene of BigCode through the lens of deduplication.

In case you have an interest, here is an updated version of the deduplication comparison that we began in BigScience:

Dataset	Input Size	Output Size or Deduction	Level	Method	Parameters	Language	Time
OpenWebText2[5]	After URL dedup: 193.89 GB (69M)	After MinHashLSH: 65.86 GB (17M)	URL + Document	URL(Exact) + Document(MinHash LSH)	$(10, 0.5, ?, ?, ?)$	English
Pile-CC[5]	~306 GB	227.12 GiB (~55M)	Document	Document(MinHash LSH)	$(10, 0.5, ?, ?, ?)$	English	“several days”
BNE5[6]	2TB	570 GB	Document	Onion	5-gram	Spanish
MassiveText[7]		0.001 TB ~ 2.1 TB	Document	Document(Exact + MinHash LSH)	$(?, 0.8, 13, ?, ?)$	English
CC100-XL[8]		0.01 GiB ~ 3324.45 GiB	URL + Paragraph	URL(Exact) + Paragraph(Exact)	SHA-1	Multilingual
C4[3]	806.92 GB (364M)	3.04% ~ 7.18% ↓ (train)	Substring or Document	Substring(Suffix Array) or Document(MinHash)	Suffix Array: 50-token, MinHash: $(9000, 0.8, 5, 20, 450)$	English
Real News[3]	~120 GiB	13.63% ~ 19.4% ↓ (train)	Same as C4	Same as C4	Same as C4	English
LM1B[3]	~4.40 GiB (30M)	0.76% ~ 4.86% ↓ (train)	Same as C4	Same as C4	Same as C4	English
WIKI40B[3]	~2.9M	0.39% ~ 2.76% ↓ (train)	Same as C4	Same as C4	Same as C4	English
The BigScience ROOTS Corpus[9]		0.07% ~ 2.7% ↓ (document) + 10.61%~32.30% ↓ (substring)	Document + Substring	Document (SimHash) + Substring (Suffix Array)	SimHash: 6-grams, hamming distance of 4, Suffix Array: 50-token	Multilingual	12 hours ~ few days

That is the one for code datasets we created for BigCode as well. Model names are used when the dataset name is not available.

Model	Method	Parameters	Level
InCoder[10]	Exact	Alphanumeric tokens/md5 + Bloom filter	Document
CodeGen[11]	Exact	SHA256	Document
AlphaCode[12]	Exact	ignore whiespaces	Document
PolyCode[13]	Exact	SHA256	Document
PaLM Coder[14]	Levenshtein distance		Document
CodeParrot[15]	MinHash + LSH	$(256, 0.8, 1)$	Document
The Stack[16]	MinHash + LSH	$(256, 0.7, 5)$	Document

MinHash + LSH parameters $(P, T, K, B, R)$

To get a way of how those parameters might impact your results, here is an easy demo for instance the computation mathematically: MinHash Math Demo.

MinHash Walkthrough

On this section, we are going to cover each step of MinHash, the one utilized in BigCode, and potential scaling issues and solutions. We are going to show the workflow via one example of three documents in English:

doc_id	content
0	Deduplication is a lot fun!
1	Deduplication is a lot fun and simple!
2	I wish spider dog[17] is a thing.

The everyday workflow of MinHash is as follows:

Shingling (tokenization) and fingerprinting (MinHashing), where we map each document right into a set of hashes.
Locality-sensitive hashing (LSH). This step is to cut back the variety of comparisons by grouping documents with similar bands together.
Duplicate removal. This step is where we resolve which duplicated documents to maintain or remove.

Shingles

Like in most applications involving text, we want to start with tokenization. N-grams, a.k.a. shingles, are sometimes used. In our example, we shall be using word-level tri-grams, with none punctuations. We are going to circle back to how the scale of ngrams impacts the performance in a later section.

doc_id	shingles
0	{“Deduplication is so”, “is a lot”, “a lot fun”}
1	{‘a lot fun’, ‘fun and simple’, ‘Deduplication is so’, ‘is a lot’}
2	{‘dog is a’, ‘is a thing’, ‘wish spider dog’, ‘spider dog is’, ‘I wish spider’}

This operation has a time complexity of $O (N M) mathcal{O}(NM)$

Fingerprint Computation

In MinHash, each shingle will typically either be 1) hashed multiple times with different hash functions, or 2) permuted multiple times using one hash function. Here, we decide to permute each hash 5 times. More variants of MinHash may be present in MinHash – Wikipedia.

shingle	permuted hashes
Deduplication is so	[403996643, 2764117407, 3550129378, 3548765886, 2353686061]
is a lot	[3594692244, 3595617149, 1564558780, 2888962350, 432993166]
a lot fun	[1556191985, 840529008, 1008110251, 3095214118, 3194813501]

Taking the minimum value of every column inside each document — the “Min” a part of the “MinHash”, we arrive at the ultimate MinHash for this document:

doc_id	minhash
0	[403996643, 840529008, 1008110251, 2888962350, 432993166]
1	[403996643, 840529008, 1008110251, 1998729813, 432993166]
2	[166417565, 213933364, 1129612544, 1419614622, 1370935710]

Technically, we do not have to make use of the minimum value of every column, however the minimum value is essentially the most common alternative. Other order statistics similar to maximum, kth smallest, or kth largest may be used as well[21].

In implementation, you may easily vectorize these steps with numpy and expect to have a time complexity of $O (N M K) mathcal{O}(NMK)$

def embed_func(
    content: str,
    idx: int,
    *,
    num_perm: int,
    ngram_size: int,
    hashranges: List[Tuple[int, int]],
    permutations: np.ndarray,
) -> Dict[str, Any]:
    a, b = permutations
    masks: np.ndarray = np.full(shape=num_perm, dtype=np.uint64, fill_value=MAX_HASH)
    tokens: Set[str] = {" ".join(t) for t in ngrams(NON_ALPHA.split(content), ngram_size)}
    hashvalues: np.ndarray = np.array([sha1_hash(token.encode("utf-8")) for token in tokens], dtype=np.uint64)
    permuted_hashvalues = np.bitwise_and(
        ((hashvalues * np.tile(a, (len(hashvalues), 1)).T).T + b) % MERSENNE_PRIME, MAX_HASH
    )
    hashvalues = np.vstack([permuted_hashvalues, masks]).min(axis=0)
    Hs = [bytes(hashvalues[start:end].byteswap().data) for start, end in hashranges]
    return {"__signatures__": Hs, "__id__": idx}

In the event you are aware of Datasketch, you may ask, why will we hassle to strip all the good high-level functions the library provides? It just isn’t because we wish to avoid adding dependencies, but because we intend to squeeze as much CPU computation as possible during parallelization. Fusing few steps into one function call enables us to utilize our compute resources higher.

Since one document’s calculation just isn’t depending on the rest. parallelization alternative can be using the map function from the datasets library:

embedded = ds.map(
    function=embed_func,
    fn_kwargs={
        "num_perm": args.num_perm,
        "hashranges": HASH_RANGES,
        "ngram_size": args.ngram,
        "permutations": PERMUTATIONS,
    },
    input_columns=[args.column],
    remove_columns=ds.column_names,
    num_proc=os.cpu_count(),
    with_indices=True,
    desc="Fingerprinting...",
)

After the fingerprint calculation, one particular document is mapped to 1 array of integer values. To work out what documents are much like one another, we want to group them based on such fingerprints. Entering the stage, Locality Sensitive Hashing (LSH).

Locality Sensitive Hashing

LSH breaks the fingerprint array into bands, each band containing the identical variety of rows. If there may be any hash values left, it would be ignored. Let’s use $b = 2$

doc_id	minhash	bands
0	[403996643, 840529008, 1008110251, 2888962350, 432993166]	[0:[403996643, 840529008], 1:[1008110251, 2888962350]]
1	[403996643, 840529008, 1008110251, 1998729813, 432993166]	[0:[403996643, 840529008], 1:[1008110251, 1998729813]]
2	[166417565, 213933364, 1129612544, 1419614622, 1370935710]	[0:[166417565, 213933364], 1:[1129612544, 1419614622]]

If two documents share the identical hashes in a band at a selected location (band index), they shall be clustered into the identical bucket and shall be regarded as candidates.

band index	band value	doc_ids
0	[403996643, 840529008]	0, 1
1	[1008110251, 2888962350]	0
1	[1008110251, 1998729813]	1
0	[166417565, 213933364]	2
1	[1129612544, 1419614622]	2

For every row within the doc_ids column, we will generate candidate pairs by pairing every two of them. From the above table, we will generate one candidate pair: (0, 1).

Beyond Duplicate Pairs

That is where many deduplication descriptions in papers or tutorials stop. We’re still left with the query of what to do with them. Generally, we will proceed with two options:

Double-check their actual Jaccard similarities by calculating their shingle overlap, attributable to the estimation nature of MinHash. The Jaccard Similarity of two sets is defined as the scale of the intersection divided by the scale of the union. And now it becomes far more doable than computing all-pair similarities, because we will focus just for documents inside a cluster. This can also be what we initially did for BigCode, which worked reasonably well.
Treat them as true positives. You most likely already noticed the difficulty here: the Jaccard similarity is not transitive, meaning $A$

From such pairs, whether or not they are validated or not, we will now construct a graph with those pairs as edges, and duplicates shall be clustered into communities or connected components. When it comes to implementation, unfortunately, that is where datasets couldn’t help much because now we want something like a groupby where we will cluster documents based on their band offset and band values. Listed below are some options we’ve got tried:

Option 1: Iterate the datasets the old-fashioned way and collect edges. Then use a graph library to do community detection or connected component detection.

This didn’t scale well in our tests, and the explanations are multifold. First, iterating the entire dataset is slow and memory consuming at a big scale. Second, popular graph libraries like graphtool or networkx have loads of overhead for graph creation.

Option 2: Use popular python frameworks similar to dask to permit more efficient groupby operations.

But then you definitely still have problems of slow iteration and slow graph creation.

Option 3: Iterate the dataset, but use a union find data structure to cluster documents.

This adds negligible overhead to the iteration, and it really works relatively well for medium datasets.

for table in tqdm(HASH_TABLES, dynamic_ncols=True, desc="Clustering..."):
    for cluster in table.values():
        if len(cluster) <= 1:
            proceed
        idx = min(cluster)
        for x in cluster:
            uf.union(x, idx)

Option 4: For big datasets, use Spark.

We already know that steps as much as the LSH part may be parallelized, which can also be achievable in Spark. Along with that, Spark supports distributed groupBy out of the box, and it is usually straightforward to implement algorithms like [18] for connected component detection. In the event you are wondering why we didn’t use Spark’s implementation of MinHash, the reply is that every one our experiments to date stemmed from Datasketch, which uses a completely different implementation than Spark, and we wish to make sure that we stock on the teachings and insights learned from that without going into one other rabbit hole of ablation experiments.

edges = (
    records.flatMap(
        lambda x: generate_hash_values(
            content=x[1],
            idx=x[0],
            num_perm=args.num_perm,
            ngram_size=args.ngram_size,
            hashranges=HASH_RANGES,
            permutations=PERMUTATIONS,
        )
    )
    .groupBy(lambda x: (x[0], x[1]))
    .flatMap(lambda x: generate_edges([i[2] for i in x[1]]))
    .distinct()
    .cache()
)

An easy connected component algorithm based on [18] implemented in Spark.

a = edges
while True:
    b = a.flatMap(large_star_map).groupByKey().flatMap(large_star_reduce).distinct().cache()
    a = b.map(small_star_map).groupByKey().flatMap(small_star_reduce).distinct().cache()
    changes = a.subtract(b).union(b.subtract(a)).collect()
    if len(changes) == 0:
        break

results = a.collect()

Moreover, due to cloud providers, we will arrange Spark clusters like a breeze with services like GCP DataProc. In the long run, we will comfortably run this system to deduplicate 1.4 TB of knowledge in just below 4 hours with a budget of $15 an hour.

Quality Matters

Scaling a ladder doesn’t get us to the moon. That is why we want to be sure that that is the suitable direction, and we’re using it the suitable way.

Early on, our parameters were largely inherited from the CodeParrot experiments, and our ablation experiment indicated that those settings did improve the model’s downstream performance[16]. We then set to further explore this path and may confirm that[4]:

Near-deduplication improves the model’s downstream performance with a much smaller dataset (6 TB VS. 3 TB)
We’ve not found out the limit yet, but a more aggressive deduplication (6 TB VS. 2.4 TB) can improve the performance much more:
1. Lower the similarity threshold
2. Increase the shingle size (unigram → 5-gram)
3. Ditch false positive checking because we will afford to lose a small percentage of false positives

Image: Two graphs showing the impact of similarity threshold and shingle size, the primary one is using unigram and the second 5-gram. The red dash line shows the similarity cutoff: any documents below can be regarded as false positives — their similarities with other documents inside a cluster are lower than the brink.

These graphs might help us understand why it was obligatory to double-check the false positives for CodeParrot and early version of the Stack: using unigram creates many false positives; Additionally they show that by increasing the shingle size to 5-gram, the share of false positives decreases significantly. A smaller threshold is desired if we wish to maintain the deduplication aggressiveness.

Additional experiments also showed that lowering the brink removes more documents which have high similarity pairs, meaning an increased recall within the segment we actually would love to remove essentially the most.

Scaling

Image: Deduplication time versus raw dataset size. That is achieved with 15 employee c2d-standard-16 machines on GCP, and every costed around $0.7 per hour.

Image: CPU usage screenshot for the cluster during processing JSON dataset.

This is not essentially the most rigorous scaling proof you could find, however the deduplication time, given a hard and fast computation budget, looks practically linear to the physical size of the dataset. While you take a more in-depth take a look at the cluster resource usage when processing JSON dataset, the biggest subset within the Stack, you may see the MinHash + LSH (stage 2) dominated the whole real computation time (stage 2 + 3), which from our previous evaluation is $O (N M) mathcal{O}(NM)$

Proceed with Caution

Deduplication doesn’t exempt you from thorough data exploration and evaluation. As well as, these deduplication discoveries hold true for the Stack, however it doesn’t mean it is quickly applicable to other datasets or languages. It’s a very good first step towards constructing a greater dataset, and further investigations similar to data quality filtering (e.g., vulnerability, toxicity, bias, generated templates, PII) are still much needed.

We still encourage you to perform similar evaluation in your datasets before training. For instance, it won’t be very helpful to do deduplication if you might have tight time and compute budget: @geiping_2022 mentions that substring deduplication didn’t improve their model’s downstream performance. Existing datasets may additionally require thorough examination before use, for instance, @gao_2020 states that they only made sure the Pile itself, together with its splits, are deduplicated, they usually won’t proactively deduplicating for any downstream benchmarks and leave that call to readers.

When it comes to data leakage and benchmark contamination, there remains to be much to explore. We needed to retrain our code models because HumanEval was published in one among the GitHub repos in Python. Early near-deduplication results also suggest that MBPP[19], one of the popular benchmarks for coding, shares loads of similarity with many Leetcode problems (e.g., task 601 in MBPP is largely Leetcode 646, task 604 ≃ Leetcode 151.). And everyone knows GitHub isn’t any wanting those coding challenges and solutions. It should be even tougher if someone with bad intentions upload all of the benchmarks in the shape of python scripts, or other less obvious ways, and pollute all of your training data.

Future Directions

Substring deduplication. Although it showed some advantages for English[3], it just isn’t clear yet if this must be applied to code data as well;
Repetition: paragraphs which might be repeated multiple times in a single document. @rae_2021 shared some interesting heuristics on the best way to detect and take away them.
Using model embeddings for semantic deduplication. It’s one other whole research query with scaling, cost, ablation experiments, and trade-off with near-deduplication. There are some intriguing takes on this[20], but we still need more situated evidence to attract a conclusion (e.g, @abbas_2023‘s only text deduplication reference is @lee_2022a, whose important claim is deduplicating helps as a substitute of attempting to be SOTA).
Optimization. There may be at all times room for optimization: higher quality evaluation, scaling, downstream performance impact evaluation etc.
Then there may be one other direction to have a look at things: To what extent near-deduplication starts to harm performance? To what extent similarity is required for diversity as a substitute of being regarded as redundancy?

Credits

The banner image incorporates emojis (hugging face, Santa, document, wizard, and wand) from Noto Emoji (Apache 2.0). This blog post is proudly written with none generative APIs.

Huge due to Huu Nguyen @Huu and Hugo Laurençon @HugoLaurencon for the collaboration in BigScience and everybody at BigCode for the assistance along the best way! In the event you ever find any error, be happy to contact me: mouchenghao at gmail dot com.

Supporting Resources

References

[1] : Nikhil Kandpal, Eric Wallace, Colin Raffel, Deduplicating Training Data Mitigates Privacy Risks in Language Models, 2022
[2] : Gowthami Somepalli, et al., Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models, 2022
[3] : Katherine Lee, Daphne Ippolito, et al., Deduplicating Training Data Makes Language Models Higher, 2022
[4] : Loubna Ben Allal, Raymond Li, et al., SantaCoder: Don’t reach for the celebs!, 2023
[5] : Leo Gao, Stella Biderman, et al., The Pile: An 800GB Dataset of Diverse Text for Language Modeling, 2020
[6] : Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, et al., MarIA: Spanish Language Models, 2022
[7] : Jack W. Rae, Sebastian Borgeaud, et al., Scaling Language Models: Methods, Evaluation & Insights from Training Gopher, 2021
[8] : Xi Victoria Lin, Todor Mihaylov, et al., Few-shot Learning with Multilingual Language Models, 2021
[9] : Hugo Laurençon, Lucile Saulnier, et al., The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset, 2022
[10] : Daniel Fried, Armen Aghajanyan, et al., InCoder: A Generative Model for Code Infilling and Synthesis, 2022
[11] : Erik Nijkamp, Bo Pang, et al., CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis, 2023
[12] : Yujia Li, David Choi, et al., Competition-Level Code Generation with AlphaCode, 2022
[13] : Frank F. Xu, Uri Alon, et al., A Systematic Evaluation of Large Language Models of Code, 2022
[14] : Aakanksha Chowdhery, Sharan Narang, et al., PaLM: Scaling Language Modeling with Pathways, 2022
[15] : Lewis Tunstall, Leandro von Werra, Thomas Wolf, Natural Language Processing with Transformers, Revised Edition, 2022
[16] : Denis Kocetkov, Raymond Li, et al., The Stack: 3 TB of permissively licensed source code, 2022
[17] : Rocky | Project Hail Mary Wiki | Fandom
[18] : Raimondas Kiveris, Silvio Lattanzi, et al., Connected Components in MapReduce and Beyond, 2014
[19] : Jacob Austin, Augustus Odena, et al., Program Synthesis with Large Language Models, 2021
[20]: Amro Abbas, Kushal Tirumala, et al., SemDeDup: Data-efficient learning at web-scale through semantic deduplication, 2023
[21]: Edith Cohen, MinHash Sketches : A Temporary Survey, 2016

Source link

Large-scale Near-deduplication Behind BigCode

Motivations

From BigScience to BigCode

MinHash Walkthrough

Shingles

Fingerprint Computation

Locality Sensitive Hashing

Beyond Duplicate Pairs

Quality Matters

Scaling

Proceed with Caution

Future Directions

Credits

Supporting Resources

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How Pokémon Go is giving delivery robots an inch-perfect view of the world

Lessons from 16 Open-Source RL Libraries

Anthropic takes U.S. government to court

Machine Learning at Scale: Managing More Than One Model in Production

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library

Large-scale Near-deduplication Behind BigCode

Motivations

From BigScience to BigCode

MinHash Walkthrough

Shingles

Fingerprint Computation

Locality Sensitive Hashing

Beyond Duplicate Pairs

Quality Matters

Scaling

Proceed with Caution

Future Directions

Credits

Supporting Resources

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.